Agree to disagree. I think you are opining about things that you are lacking fundamental knowledge on.
> The structure of text is not the structure of the world. This thesis is mad. Its a scientific thesis. It is trivial to test it. It is trivial to wholey discred it. It's pseudoscience.
It's unclear what you even mean by that. Are the electrical impulses coming to our brain the "structure of the world"?
The structure of having X apples in Y buckets is the same as the structure in the expression "X * Y", as long as the expression exists in a context that can parse it using the rules of arithmetic, such as a human, or a calculator.
These language models lack context, not just for arithmetic, but for everything. They can't parse "X * Y" for any X and Y, they've just associated the expression with the right answer for so many values of X and Y, that we get fooled into thinking they know the rules.
We get fooled into thinking they've learned the structure of the world. But they've only learned the structure of text.
It would be trivial for a network of this size to code general rules for multiplication.
At a certain point, when you have enough data, finding the actual rule is actually the easier solution than memorizing each data point. This is the key insight of deep learning.
Really? Better inform all the researchers working on this that they're wasting their time then: https://arxiv.org/abs/2001.05016
More fundamentally, any finite neural net is either constant or linear outside the training sample,depending on the activation function. Unless you design special neurons like in the paper above, which solves this specific problem for arithmetic, but not the general problem of extrapolation.
> any finite neural net is either constant or linear outside the training sample
Hence why the structure of our bodies has to include the capacity for imagination. Our brain structure does not record everything that has happened. It permits is to imagine an infinite number of things which might happen.
We do not come to understand the world by having a brain-structure isomorphic to world structure -- this is none-sense for, at least, the above reason. But also, there really isnt anything like "world structure" to be isomorphic to. Ie., brains arent HDDs.
They are, at least, simulators. I dont think we'll find anything in the brain like "leaves are green" because that is just a generated public representation of a latent-simulating-thought. There isnt much to be learned about the world from these, they only make sense to us.
That all the text of human history has associations between words is the statistical coincidence that modern NLP uses for its smoke-and-mirrors. As a theory of language it's madness.
Well sure, but neurons are still universal approximators. Any CPU is a sum of piecewise linear functions. I don't see where this meaningfully limits the capabilities of an AI, since once we're multilayer there's no 1:1 relation between training samples and piece placement in the output.
I just don't see how that's relevant. Nobody uses one-hidden-layer networks anymore. Whatever GPT is doing, it has nothing to do with approximating a collection of samples by assembling piecewise functions, except in the way that Microsoft Word is based on the Transistor.
Should math about a vaguely related topic convince me about this? Multilevel ANNs act differently than one-level ANNs. Transformers simply don't have anything to do with the model of approximating functions by assembling piecewise functions. This is akin to arguing that computers can't copy files because the disjunctive normal form sometimes needs exponential terms on bit inputs, so obviously it cannot scale to large data sets - yes, that is true about the DNF, but copying files on a computer simply does not use boolean operations in a way that would run into that limitation.
The way that Transformers learn has more to do with their multilayering than with the transformation across any one layer. Universal approximation only describes the things the network learns across any pair of layers, but the input and output features that it learns about in the middle are only tangentially related to the training samples. You cannot predict the capabilities of a deep neural network by considering the limitations of a one-layer learner.
>We get fooled into thinking they've learned the structure of the world. But they've only learned the structure of text.
To what degree does the structure of text correspond to structure of the world, in the limit of a maximally descriptive text corpus? Nearly complete if not totally complete, as far as I can tell. What is left out? The subjective experience of being embodied in the world. But this subjective experience is orthogonal to the structure of the world. And so this limitation does not prevent an understanding of the structure.
The point is that not only is it impossible to infer the structure of the world from text, deep learning is incapable of learning about or even representing the world.
The reason language makes sense to us is that it triggers the right representations. It does not make sense intrinsically, it's just a sequence of symbols.
Learning about the world requires at least causal inference, modular and compact representations such as programming languages, and much smarter learning algorithms than random search or gradient descent.
I don't know why you think this. There is much structural regularity in a large text corpus that is descriptive of relationships in the world. Eventually the best way to predict this regularity is just to land in a portion of parameter space that encodes the structure. But again, in the limit of a maximally descriptive text corpus, the best way to model this structure is just to encode the structure of the world. You have given no reason to think this is inherently impossible.
>There is much structural regularity in a large text corpus that is descriptive of relationships in the world.
Sure, there is a lot. But let's say we want to learn what apples are. So we look at occurrences of "apple" in the text corpus, and learn that apples can be eaten, they can be sweet, sometimes they are sour, red, sometimes green, and so on.
Can apples spontaneously change size? Hmm, no idea, no mention of that in the text. Can they be used as a knife? Dunno. If I had an apple in New York 4 minutes ago, can someone else be eating the same apple in Hong Kong now? Dunno. Did apples exist in France two million years ago? Dunno. Can you drive to Saturn in an apple? Dunno.
In short, there's no actual model of what an apple is, as an object in space, connected to other objects by various relationships. If there were, the model could figure out the answers to the questions above by inheritance.
Maybe these particular questions happen to be answered correctly by PaLM. Maybe not, but the next LLM will include this comment in the training corpus.
But the reason GPT-3 and other models tend to make no sense is because their output is not constrained by reality. The text in the training corpus tends to conform to reality, but when you prompt the model with questions that nobody would ever write text about, the illusion falls away:
Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.
Q: How many eyes does my foot have?
A: Your foot has two eyes.
Q: How many eyes does a spider have?
A: A spider has eight eyes.
Q: How many eyes does the sun have?
A: The sun has one eye.
Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.
Q: How do you sporgle a morgle?
A: You sporgle a morgle by using a sporgle.
Q: How many bonks are in a quoit?
A: There are three bonks in a quoit.
Q: How many rainbows does it take to jump from Hawaii to seventeen?
A: It takes two rainbows to jump from Hawaii to seventeen.
Q: Which colorless green ideas sleep furiously?
A: Ideas that are colorless, green, and sleep furiously are the ideas of a sleep furiously.
Q: Do you understand these questions?
A: I understand these questions.
>In short, there's no actual model of what an apple is, as an object in space, connected to other objects by various relationships.
I don't know why you think language models are fundamentally unable to deduce the knowledge of the points you mention. Much knowledge isn't explicitly stated, but is implicit and can be deduced from a collection of explicit facts. For example, apples are food, food is physical matter, physical matter is fixed in size, cannot be in two places at once, maintains its current momentum unless acted on by a force, etc. Categorization and deducing properties from an object's category is in parameter space of language models. There's no reason to think that a sufficiently large model will not land on these parameters.
>But the reason GPT-3 and other models tend to make no sense is because their output is not constrained by reality.
The issue isn't what GPT-3 can or cannot do, its about what autoregressive language models as a class are capable of. Yes, there are massive holes in GPT-3's ability to maintain coherency across wide ranges of contexts. But GPT-3's limits does not imply a limit to autoregressive language models more generally.
The demonstration is irrelevant. The issue isn't what GPT-3 can or cannot do, but what this class of models can do.
Reduce knowledge to particular kinds of information. Gradient descent discovers information by finding parameters that correspond to the test criteria. Given a large enough data set that is sufficiently descriptive of the world, the "shape" of the world described by the data admits better and worse structures to predict the data. The organizing and association of information that we call knowledge is a part of the parameter space of LLMs. There is no reason to think such a learning process cannot find this parameter space.
It doesn't. It's pattern matching, and you're seeing cherry picked examples. The pattern matching is enough to give the illusion of understanding. There's plenty of articles where more thorough testing reveals the difference. Here are two:
https://medium.com/@melaniemitchell.me/can-gpt-3-make-analog...
But you could also just try one of these models, and see for yourself. It's not exactly subtle.
GPT-3 was specifically worse at jokes, which is why PaLM being good at this so impresses me. At any rate, I don't care if it only works one in ten times. To me, this is equivalent to complaining that the dog has bad marks in high school. (PaLM could probably explain that one to you: "The speaker is complaining that the dog is only getting C's. For a human a C is a quite bad mark. However getting even a C is normally impossible for a dog.")
"It's pattern matching" just sounds like an excuse for why it working "doesn't really count". At this point, you are asking me to disbelieve plain evidence. I have played with these models, people I know have played with these models, I have some impression of what they're capable of. I'm not disagreeing it's "just pattern matching", whatever that means, I am asserting that "pattern matching" is Turing-complete, or rather, cognition-complete, so this is just not a relevant argument to me.
If you threw a thousand tries at a Markov chain, to use the classic "pure pattern matcher", it could not do any fraction of what this model does, ever, at all. You would have to throw enough tries at it that it tried every number that could possibly come next, to get a hit. So one in ten is actually really good. (If that's the rate, we have zero idea how cherrypicked their results actually are.)
And the errors that GPT does tend to be off-by-one errors, human errors, misunderstandings, confusions. It loses the plot. But a Markov chain never even has the plot for an instant.
GPT pattern-matches at an abstract, conceptual level. If you don't understand why that is a huge deal, I can't help you.
It's a pretty big deal, and there's a big difference between a Markov chain and a deep language model - the Markov chain will quickly converge, while the language model can scale with the data.
But the way these models are talked about is misleading. They don't "answer questions", "translate", "explain jokes", or anything of that sort. They predict missing words. Since the network is so large, and the dataset has so many examples, it can scale up the method of
1) Find a part of the network which encodes training data that is most similar to the prompt
2) Put the words from the prompt in place of the corresponding words in the encoding of the training data
i.e. pattern matching. So if it has seen a similar question to the one given in the prompt (and given that it's trained on most of the internet, it will find thousands of uncannily similar questions), it will produce a convincing answer.
How is that different from a human answering questions? A human uses pattern matching as part of the process, sure. But they also use, well, all the other abilities that together make up intelligence. They connect that meaningless symbols of the sentence to the mental representations that model the world - the ones pertaining to whatever the question is about.
If I ask a librarian "What is the path integral formulation of quantum mechanics?", and they come back with a textbook and proceed to read the answer from page 345, my reaction is not "Wow, you must be a genius physicist!", it's "Wow, you sure know where to find the right book for any question!". In the same way, I'm impressed with GPT for being a nifty search engine, but then again, Google search does a pretty good job of that already.
Understanding of what? What the joke is about? Then no, it has no idea what any of it means. The syntactic structure of jokes? Sure. Feed it 10 thousand jokes that are based on a word found in two otherwise disjoint clusters (pod of whales, pod of TPUs), with a subsequent explanation. It's fair to say it understands that joke format.
If you somehow manage to invent a kind of joke never before seen in the vast training corpus, that alone would be impressive. If PaLM can then explain that joke, I will change my mind about language models, and then probably join the "NNs are magic you guys" crowd, because it wouldn't make any sense.
Good point, coming up with a novel joke is no joke. There's a genuine problem where GPT is to a first approximation going to have seen everything we'll think of to test it, in some form or other.
Of course, if we can't come up with something sufficiently novel to challenge it with, that also says something about the expected difficulty of its deployment. :-P
I guess once we find a more sample-efficient way to train transformers, it'll become easier to create a dataset where some entire genre of joke will be excluded.
> The structure of text is not the structure of the world. This thesis is mad. Its a scientific thesis. It is trivial to test it. It is trivial to wholey discred it. It's pseudoscience.
It's unclear what you even mean by that. Are the electrical impulses coming to our brain the "structure of the world"?