While I agree with the authors that large language models only trained on text lack the ability to distinguish "possible worlds" from reality, I think there is a path ahead.
Large language models might be excellent candidates for evolutionary methods and RL. They need to learn from solving language problems on a massive scale. But problem solving could be the medicine that cures GPT-3's fuzziness, a bit of symbolic exactness injected into the connectionist system.
In a sense, I thought it was more human-in-the-loop rather than an explicit RL objective (i.e. potentially somewhat limited reward surface, even if there's a reward model trained from it)
Interesting how two camps are emerging for LLMs. This one is about how GPT actually learns something, the other one represented by Chomsky and Gary Marcus represent how GPT has learned nothing (https://news.ycombinator.com/item?id=34278243)
I think the difference is what "learned" means. This paper basically says that any finite amount of knowledge compression is learning, whereas the other camp defines learning as some kind of infinite information compression like being able to add any two numbers no matter how large, which is something no language model will ever be able to do.
Personally, I think both sides are right, GPT3 has compressed language patterns across the internet, but its undeniable that GPT3 makes up stuff frequently. Overall, the point still stands that these language models are valuable in some specific contexts, but its not clear how far they can go.
>the other camp defines learning as some kind of infinite information compression like being able to add any two numbers no matter how large, which is something no language model will ever be able to do.
I know this isn't strictly an LLM, but can't there be an "extension" that the LLM learns a formula and how to plug values into it (it already seems pretty good at explaining code, so it can already do this) and then new: the ability to actually perform the calculation - execute the code/formula or at least "use a calculator" with the values.
The paper addresses the same idea, using the same example. Sure, that can be done. I suppose it’s a case of evaluating the prompt and then selecting from various different mechanisms for generating a response, one or several of which might be an LLM. That’s interesting but really out of open for a discussion if LLMs themselves though.
Humans can not add any two numbers because of eventual mistake, which makes the second definition trivially useless in regards to comparing human intelligence vs AI.
Would there be any benefit in modeling raw binary sequences rather than tokens?
I think text prediction only gets you so far. But I guess you could use the same principles to predict the next symbol in a binary string. If this binary data represents something like videos of physical phenomena, you might get the AI to profound, novel insights about the Universe just with next-bit prediction.
Hmmm, maybe even I could code something like that.
I’m waiting for someone to make a GPT-style model trained for video and audio prediction (e.g. frame by frame, perhaps) in addition to the existing text prediction. Imagine using a significant percentage of YouTube content, for example.
It would probably be insanely expensive. But I feel like it would be almost guaranteed to acquire a world model far richer and more robust than ChatGPT’s.
Human babies learn by watching the world around them. Video frame prediction feels much closer to that than text prediction, and given the wildly impressive results we are seeing with large text prediction models alone, it seems like an obvious next step.
> a significant percentage of YouTube content [...] a world model far richer and more robust than ChatGPT’s
There are three objections to this.
The first is the astonishingly large amount of CPU power that this would take, given how high bandwidth video is. The second is that it is hard to believe that some thing really coherent could emerge from this, and certainly, it has never been shown. The third is that the world might be "richer" in terms of information-rich but seeing the world through YouTube's eyes would likely be a degrading and incoherent experience.
Google Research has a character-based transformer that learns to tokenize text rather than relying on hand coded tokenizers. It demonstrates superior performance on a variety of LLM tasks.
If you have the money, you can apply the transformer architecture to many different tasks and people are experimenting all the time. I think one of the big challenges is always to come up with methods for training such enormous models pragmatically without cost exploding.
Tokenization for models like GPT or BERT can be seen as compression. That is, frequent words are separate tokens. Frequent sequences are separate tokens. On the other hand, if a sequence is very uncommon, then it will contain many tokens.
Sure, you encode bit-by-bit. But it is a fixed-length code, which is even worse than character-by-character.
Maybe you only get worse training and inference time. But I wouldn't be surprised if the encoding also serves as a Bayesian prior, and with a different encoding, you get worse results (for given data).
It's important to remember that the "power" of gpt doesn't come from the model, but from the sheer scale of the dataset. It's trained on the entire internet, in text form. You can 100% use a transformer architecture to train on binary data. But what data do you have hundreds of tebibytes of?
Language also follows very common and repeatable patterns. "Hello" is often followed by "How are you?", etc. Just like Zipf's Law dictates that some words are used exponentially more than others, there are linguistic and conceptual patterns that appear with predictable frequency. If your bits don't follow similar rules, the results might not be as clean.
I'm pretty sure you could code a transformer to work on binary or video data. Sounds like a great github project. But it's unlikely you'll have the scale of data to do anything close to ChatGPT.
The paper cited there by contrast, argues for select training sets:
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations.
The reason this works for tokens is that tokens are put in a vector space where similar words are in a similar place. The same effect could not be achieved with characters or bits. If you think about it our brains also remember words and not characters
Word sounds. I can not read without hearing the word. (Now I wonder about those born deaf.) Based on that subjective experience which I presume is rather universal among the hearing, tokenized phones and phonemes seem promising.
I just tried that. Not sure I like the feeling, seems to rob of the pleasure of reading. But it -is- an interesting effect. Do you get pleasure from it?
I read without hearing the words. If I'm tired, however, sometimes I need to start reading out loud to keep concentrated.
Also, non phonetic writing systems will push you towards not hearing while reading. For example, when I read chinese there are plenty of characters (hanzi) whose pronunciation I don't know (was it hai? Pai?), but I remember the meaning just fine.
Even with English (I'm not a native speaker) sometimes I won't remember how to pronounce a certain word, but I'll read it just fine.
> Also, non phonetic writing systems will push you towards not hearing while reading. For example, when I read chinese there are plenty of characters (hanzi) whose pronunciation I don't know (was it hai? Pai?), but I remember the meaning just fine.
Makes sense. I also tried an earlier suggestion to (effectively speed) read and that seems to send constant interrupts to the speech center -- which is how I am beginning to imagine how it works -- and it struggles to keep up and very soon becomes silent. atm still prefer my apparently pedestrian reading mode as it has a strong 'pleasure of reading' aspect that as of now I do not get when attempting whole word parsing.
How do you experience the pleasure of reading? For me, the reconstructed voice of the author and in general the (for the lack of a better word) musical aesthetics of the 'text' are significant.
That might be too low(?) resolution. It would be learning encodings instead of features of the thing that is being encoded. Like training it on terabytes of zip files and expecting it to reproduce from the files contained in the archives.
The thing is even the video data outside of the tar file is also encoded. Most likely the compressed video data will be basically random. You can't train it on random data, it's just noise. Rather it would make more sense to train on sequences of RGBA pixels.
It's a seductive thought to be able to just throw raw bits at a model, regardless of what those bits represent, and have it just magically attain LLM qualities in reproducing the data you would want it to.
Something to think about: GPT3/ChatGPT tokenize at the byte level. If they tokenized at the bit level the model would learn Utf8 encoding over time. Unicode characters that require more than one byte to represent, such as emojis, are not learned directly but the model can still reproduce them.
There's a couple of massive intuition leaps here (around tokens and the ease of which predicting one modality extends to another), but if you're interested in diving into the field at the place where they're asking questions like this, you could start by looking at the transition from BPE to the tokenizer we have today for the tokenization front, and PercieverIO for the multimodal generalization front.
So you are proposing a massive video model, on the likes of GPT-3? The architecture is simple, but making it train correctly and efficiently is really hard, especially for video.
Not quite. I meant something that models pure binary sequences, not higher level tokens. That way, it could learn from any source that can be represented as binary data. Could be video, text, audio, or all three at once.
It wouldn't be "video model", it would be an "anything that can be expressed in binary" model.
> Perceiver: General Perception with Iterative Attention
Biological systems perceive the world by simultaneously processing high dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. Perceiver is a deep learning model that can process multiple modalities, such as images, point clouds, audio, and video, simultaneously. It is based on the transformer architecture and uses an asymmetric attention mechanism to distill a large number of inputs into a smaller latent bottleneck. This allows it to scale to handle very large inputs and outperform specialized models on classification tasks across various modalities.
Each frame of the image would have to be divided into many sequences. Atleast that's how transformer based image models work. Then you have to account for audio data too in the same way. It just blows up the compute required
Related, I wrote "language games" for playing word games with word vectors. I've thought about remaking this beyond my original weekend project and including the latest of language models in.
https://github.com/Hellisotherpeople/Language-games
Paper contributes to debate about abilities of large language models like GPT-3
Evaluates how well GPT performs on the Turing Test
Examines limits of such models, including tendency to generate falsehoods
Considers social consequences of problems with truth-telling in these models
Proposes formalization of "reversible questions" as a probabilistic measure
Argues against claims that GPT-3 lacks semantic ability
Offers theory on limits of large language models based on compression, priming, distributional semantics, and semantic webs
Suggests that GPT and similar models prioritize plausibility over truth in order to maximize their objective function
Warns that widespread adoption of language generators as writing tools could result in permanent pollution of informational ecosystem with plausible but untrue texts.
I have reservations about several aspects of this article but what sits least well is the substantial conclusions regarding AI compression loss. In short, I disagree.
The "Hill" example in the text is easily understandable by the author's own presented concept of "text-based, word placement, associative semantics" vs "semantics by definition" we obviously use semantics in the latter sense, hence it's definition, the AI doesn't.
Semantic word relationships identified by GPT3 are based on the frequency of words prior to and following a position in a sentence/text – as presented to the AI in the programmed/learned dataset. Another easier example is when information is known to be untrue. If I include 1,000 written examples of the words “Jack and Jill ran down the volcano” than my AI will be incorrectly answering prompts to finish the nursery rhyme. How many instances of providing users the wrong answer or number of analyzed writings with the correct “ran down the HILL” text before my 1,000 false volcano statements ceases to be the most probable and likely accepted answer.
So, like the article example, if asked "Where was John Smith born?" the AI sees that it has to answer definitively bc its been asked question, so it's going to make a statement it concludes to be the most probable acceptable answer to the prompt - it doesn't see the prompt as a question, the words are not defined as ideas of themselves, nor do the sum of the words present an idea. Word definitions are not really part of the answer process. The AI checks it's dataset and knows all the related word examples it's previously identified through its handy token system – that self-controlled tokenization of memory for storage & retrieval, further removes this from a human brain-like function/process we can empathize with.
Anyways, the AI knows that most statements following words arranged like the question in the prompt include words/textual identifiers not used similarly used in other places of texts – Names! Imagine how it figured out names without understanding definitions conceptually by using frequency of word composition and grammar structure alone. Even though it “knows” the definition of the word “name” that definition is just more words – it has no meaning without context. Prompts provide context not definitions.
There are 3 names in the example prompt question, first/last name and the name/word required as an answer – the location! None of these words the AI see as a person's name or the name of a place, it sees our question as "word that requires my reply to be a definitive statement, primary “name word”, secondary “name word” + "born" --- it knows a different type of name word (a place) is the most plausible next word in sequence because it has a whole token dedicated just to birthdates, with limitless examples. Upon searching its dataset for John Smith, identifies "Hill" as the name word most often associated with the words “where” “was” “john” “smith” “born” used consecutively. The incorrect city, Hill, makes sense given his academic career obviously generating more digital information than his birth announcement/obituary in the hometown paper.
Regarding the wrong date – the AI was never actually answering a question and never made any statements with intent to be truthful. The incorrect birthdate is simply the most probable date given the incorrect "John Smith born in Hill" statement. It couldn't present the correct date following the word “Hill” bc no such examples existed with a higher probability of being acceptable than the incorrect date given. In fact, given the incorrect semantic links made early in the answer, an incorrect date was most probable.
None of that is compression loss. It's just an AI being an AI, doing exactly what it does. I think it's obvious, based solely on what the authors presented themselves, that it's in fact recalling everything - only arriving at a failed answer due to differing expectations of what the answer was. The AI delivered the most probable reply to the prompt, given the contextual data available to it – the same way it delivers answers we expect and are factually correct. It didn’t draw incorrect conclusions bc it chunked everything it learned up and consequentially “lost” some its “memories” in the process.
Programming AI with facts, or only factual information, doesn't solve the problem at all - operating with only factual data would help it regurgitate a "born in this town on this day" type answer more correctly but only bc the token words identified as correlated, in a factual text, do in fact have actual correlation. That only increases the probability of an AI arriving at a "correct" reply/answer while still using the flawed “logic” that allowed for these errors to occur. An AI that speaks only true statements will still have no actual concept of truth.
“Prediction leads to compression, compression leads to generalization, generalization leads to computer intelligence.” - quote from the article.
I know when ppl do this, memory chunking, we do lose stuff – why do we assume that is true for an AI also? What exactly are they compressing? Our memories are filled with lots and lots of data beyond the reason a memory is a memory. The background clutter, noises of a crowd, cars driving by or what lunch was that day, are not necessary to recall the memory of your first kiss, for example – unless you were in a crowded cafeteria at a racetrack, that might be all you recall then. A kiss, loud crowd, race cars – from an entire day of activity, those highlights will be all that remain in time. We need to do that - even with that feature we still forget important things.
What background noise is an AI having to “chunk” away? The parameters for it too broadly set? - narrow them. If it sees too much, we tell what not to see. If it's capacity to effectively store and utilize information encountered as it exists, than we have failed to create an effective AI.
If an AI “reads” a 500 page paper and tokenizes the data – what makes you so sure it cannot recall exactly all 500 words from that token alone?
AI compression loss, in a tokenized type system, would have to derive from the further compression of the tokens themselves or failure with the token system.
Just my quick 1,000+ words
Yeah... sry for the book.
Tl;dr -
I find the idea of AI being wrong due to “compression loss” to be a silly concept.
We should avoid humanizing AI and AI learning – all similarity lies on the surface.
Thanks for reading my rant – have a great day! - Jakksen
> GPT can competently engage in various semantic tasks. The real reason GPT’s answers seem senseless being that truth-telling is not amongst them
GPT can (only) mimic speech it is trained on. It can sound or read like real world human speakers it is mimicking. But it can not REASON about whether what it is saying is "true" or logically consistent.
It can not reason. Intelligence requires the ability to reason, logically, right? Therefore I posit GPT is not intelligent. Therefore it can not be AI.
I haven't used GPT but I wonder what happens if you ask it to explain its reasoning to you? What happens if you ask it whether it thinks what it says is true and why it thinks so?
But are not reasoning, just probabilistic luck of the draw.
You can indeed adjust the priors by rewriting your prompt, here for instance adding context when a student saying the dog ate her homework did produce a chewed up paper, but it's you doing the reasoning and rewriting the prompt.
> when it seems to gain the ability
"Seems" is correct. It's a parlor trick, like horoscopes and cold reading, with intelligence in the interlocutor.
There’s no sensible reason in this life to read a lesswrong post, c‘mon. That’s not a reasoning system, it’s an online religion based on worshipping Bayes’ theorem.
The “probabilistic” is part of the fixed-function evaluation of the model though. That’s something we humans added; if it makes it not work, just don’t do that.
Nice link, I wasn't aware of Bayenism. As to why people worship neural networks "AI" I think they like it because they are lonely. It gives them hope that there could be a machine they could talk to and the machine would understand and offer comfort and advice. I think it will, to many people, if they juts believe in it. A bit like placebo which works as well. Cargo cult, religion. "AI" will be our savior they think.
But it does not seem to know when it is right or not does it? It doesn't seem to be conscious of its own reasoning or is it? And if you ask it about its reasoning it could give an answer that might or might not be true right? So what confidence can we have that it is actually doing reasoning in one form or another?
When we see a movie on the screen we see what looks like people speaking and doing things but really it's just a recording. The images on the screen or celluloid possess no intelligence even though on the screen they seem to behave intelligently.
The difference between movies and the language models is that the so called AI-bot is like a movie which has a big tree of possible plot-choices it can take depending on how the audience cheers or boos. That doesn't mean there is any process we could call "reasoning" behind the bot's answers. You know like logical deduction.
A person can explain how they reasoned some conclusion. And so can SYMBOLIC AI. But these trained language models they just mimic the observed speech, not the reasoning processes that go on inside the brains of people who are speaking. If they were we should be able to produce a trace of such reasoning, and the language models should also be able be able to describe it to us, like humans can.
You can ask it questions noone has ever asked before, that it can't possibly have memorized the answers to, and simply say "explain the answer step by step". It's not guaranteed that it will answer correctly, but it often does answer correctly and with a chain of reasons.
The cases where it answers incorrectly are sometimes reasoning failures, sometimes because it doesn't know its sources (because the model training doesn't "cite its sources" when dumping knowledge in there), and sometimes because of fixed-function parts of the model (like how it doesn't see individual numbers/letters, but rather tokens using compressed "byte pair encoding".)
In the joke explainer example posted here a couple weeks back, the "chain of reasons" why each joke was funny was dead wrong though probably right for most jokes. It has no reason, only likelihoods.
Your belief it has a chain of reasons is usually confirmed because the dumb probability-selected list is, by design, probably plausible.
> It's not guaranteed that it will answer correctly,
Why? Because its reasoning-algorithm is not correct, because it has none. All it has is statistics, likelihoods, which makes it look like as if there was some reasoning behind it.
Statistics are the only thing anything has. You don’t have an algorithm either.
Or rather, you (or a CPU or an ML model) have a “statistical” base layer and then other abstractions on top of that which are capable of being (relatively) deterministic on top of the unreliable execution machinery.
Large language models might be excellent candidates for evolutionary methods and RL. They need to learn from solving language problems on a massive scale. But problem solving could be the medicine that cures GPT-3's fuzziness, a bit of symbolic exactness injected into the connectionist system.
For example: "Evolution through Large Models" https://arxiv.org/abs/2206.08896
They need learning from validation to complement learning from imitation.