Hacker News new | past | comments | ask | show | jobs | submit login
GPT-3 (gwern.net)
291 points by cocoflunchy 10 months ago | hide | past | favorite | 200 comments

Personally, as a ML researcher, I find GPT-3 very unsatisfying. There aren't any novel architectural details, it doesn't improve our "fundamental" understanding of the field, and it requires the type of computation I have no chance of getting.

As a fan of the field, however, it is undoubtedly very cool.

I still think there's tasks that GPT-3 has no chance of tackling (say, generating code to solve a novel programming task), but the bitter lesson is a bitter one indeed...

It's kind of LHC-like, isn't it? Same physics "but larger".. "will it work if we scale it up? Apparently: yes."

I think the big point, and what scares me a bit, is that we have yet to discover any sort of fundamental conceptual limit to the Transformer architecture.

Only very remotely "kind of". LHC isn't mindless "will it work if we scale it up?" type of experiment at all. The idea behind it is very simple: there are fundamental properties we know we are after, we have some predictions about how particles behave, and we know we want to smash the particles together hard enough to verify some of these predictions. We know how to accelerate charged particles, so here we go, now we need powerful particle accelerator. Building or even running LHC isn't the experiment by itself, it's just an "easy" way to find out how theoretically predicted particle states look in the real life (hopefully disproving some possible theories).

Many of ML experiments, though, are exactly what you said: we ultimately have no slightest idea, what we are doing, but maybe if we throw enough compute-power into gradient-descent over some generic enough network, it will learn to do something interesting. Well, I tell you what, theoretically you need only 2 wide layers of neurons to compute anything there is to compute, there just isn't enough computational power in the entire world to possibly come up with the "correct" neuron weights entirely by chance, so what the research mostly really is about is coming up with various tricks to make do with less computational resources, which is essentially making sense of some computational problems and making neural networks "less random", to direct the learning process. It is either about new neural architectures, new ways of posing questions for NN to answer, ways to make more training data out of the same amounts of available external data, or the alternatives for the gradient-descent learning approach altogether.

So, to wrap it up. It may be kinda interesting to know that GPT-2 type of network didn't reach its full capacity, and if we scale it up even more, it still learns something new. Inspecting how it behaves also might lead to some insights or new possible applications. But ultimately, if there's no novelty of any kind in an experiment, training an NN nobody can realistically reproduce is a great way to show off (or to achieve some practical goal for the owner of this NN, if it was used to find something practical), but doesn't really contribute much (or anything) to ML research.

"Yes, but."

2-wide layers can compute any computable function, but they do not scale in any manageable fashion with problem complexity. (Because you have to model problems as increasingly large lookup tables, whose size explodes rapidly.) The big thing about GPT is that so far, improvement in GPT size has corresponded to a .. I think logarithmic?- improvement in both quality of response and complexity of the problem domain that GPT can model. GPT scales, in exactly the way that a 2-layer network doesn't, and we don't yet have a handle on at what size it stops scaling.

Uh. No, you must be misreading me. It's just "no". Let me simplify: I'm plainly stating that you are absolutely wrong, and while GPT-3 is a "will it work if we scale it up?" type of experiment, LHC totally isn't and even comparing these 2 things is silly.

Are you sure we know exactly what's going to happen when we scale up LHC experiments? You don't think we can discover something completely new?

I think what OP is saying is that physics experiments are guided by theory, in the sense that their goal is to prove or disprove this or that theoretical claim. It's not to fish for interesting correlations in a set of observations collected at random.

I don't know physics enough to give a sensible example, but in physics, as generally in science, a theory is first formed to explain a set of observations, then the theory is tested against new observations and either discarded, if the new observations do not agree with the theory, or accepted otherwise (where "accepted" doesn't necessarily mean that all debate will cease and everyone will agree the theory is true). So, in short, the point of the LHC is to test the predictions of some theory (I think what it's testing is the standard model but not sure) and its size is a function of how easy or hard it is to test such preditions, not an attempt to "look harder" just in case something comes up. And if "something completely new" does come up, then the cycle starts all over again- with a new theory and new experiments to test its predictions. Just "discovering something completely new", i.e. making a new observation, doesn't really tell us anything until we have an explanation for it, in the form of a scientific theory that can be tested.

In machine learning we do not have any theory to guide our experiments so the done thing is to try things and see what works. So just because LHC and GPT-3 are both "large", doesn't mean they have the same goals- or that the goal of the LHC is just to be large, because large is better.

To clarify "will it work if we scale it?" is not much of a scientific claim. Because it doesn't tell us anything new. We've known that it's possible to improve performance by spending more computing resources for a long time now. If I run bubblesort on a supercomputer for a month and tell people that I have sorted a list of integers larger than anyone has ever done before- will people fall over their chairs in amazement? Of course not. Not in any other field of computer science (except perhaps high performance computing) is scaling resources an acceptable way to claim progress. That it is in machine learning is a result of the fact that we have no guiding theory to drive our experiments. So people just try things to see what works.

Obviously, if someone has more resources than almost everyone else, they can try more things and hope to luck out on more interesting results. That's the reason why it's always Google, OpenAI, Uber, Facebook etc. that are in the news for "new machine learning results". They got more stuff to throw at more walls and more eyes to see what sticks.

I think you might be downplaying the scientific method involved in development of novel ML models and methods. It's not exactly a random walk process. Most of the progress has been the result of people trying to model what's going on in our heads: convnets modeling our vision system, rnns modeling feedback loops, reinforcement learning modeling sparse reward signals, attention based models modeling, well, attention. Network training methods (e.g. SGD) are based on optimization theory. There are plenty of theories trying to explain why or how things work in deep learning. Most of these are probably wrong, but some, sooner or later, will turn out to be right. Not unlike physics which for years had competing theories (e.g. string theory vs quantum gravity, etc).

The original transformer experiment tested the hypothesis: "if we organize layers of attention mechanisms in a certain way and feed them tons of text they will be able to process that text effectively enough to build very rich and consistent language model". GPT experiments test the hypothesis "if we feed the transformer more data its language model might become rich and consistent enough to produce human level results". To me this sounds like an well defined scientific experiment.

Using your analogy, GPT-3 is more like if you devised an algorithm which produces n + k of pi digits after processing n pi digits - without knowing anything about how to compute pi, or what pi is. To me that deserves falling over my chair in amazement.

Specifically about attention:

>> The original transformer experiment tested the hypothesis: "if we organize layers of attention mechanisms in a certain way and feed them tons of text they will be able to process that text effectively enough to build very rich and consistent language model".

I read the paper when it came out and I checked it again now to confirm: there's no such hypothesis in there. In fact "Attention is all you need" is a typical example of a post-hoc paper that describes what a research team did that worked and how well it worked. "We tweaked these knobs and out came STUFF!". Typically for deep learning papers it lacks a theory section, the space of which is instead taken by an "Architecture" section wich well, describes the architecture. There are no theorems, or proofs. There is nothing that connects some kind of theoretical claim to the experiments. The main claim of the paper is "we build this system and it has better performance than previous systems". Like I say in my earlier comment, that's not an interesting scientific claim.

I'm sorry but personally I find that kind of work irritating. "We tried some stuff and got some results". Woo-hoo. But, why did you try that stuff and why did you get those results? Did you try to get the same results without that stuff? Did you try to get some other results with the same stuff? Can you explain what is going on in your system and why it does that when I twist this knob? If I twist that knob, can you tell me what it will do without having to run it first to find out? Typically, 99% of the time, the answer to all this is "no" (i.e. no ablation experiments, no theoretical explanations, etc. etc, no nothing). It's like I say above, just throwing stuff at the wall to see what sticks. And then writing a paper to describe it.

Oh, and calling stuff suggestive names like "attention". If I call it "boredom", am I more or less justified than the authors?

>> Most of the progress has been the result of people trying to model what's going on in our heads: convnets modeling our vision system, rnns modeling feedback loops, reinforcement learning modeling sparse reward signals, attention based models modeling, well, attention.

The problem with all those advances is that they all happened more than 20 years ago. My comment discusses the state of machine learning research right now, which is that there are very few new ideas and the majority of the field doesn't have a clear direction.

Note also that the advances you describe were not "guided by theory". They were inspired by ideas about how the mind works. But, finding ideas to try is not the scientific process I describe above. And just because you're inspired by an idea doesn't mean your work is in any way a proof or disproof of that idea. For example, CNNs were not created in an effort to demonsrate the accuracy of a certain model of the visual cortex. In fact, Yan LeCun is on record saying that deep learning is nothing like the brain:

Yann LeCun: My least favorite description [of deep learning] is, “It works just like the brain.” I don’t like people saying this because, while Deep Learning gets an inspiration from biology, it’s very, very far from what the brain actually does.


>> Using your analogy, GPT-3 is more like if you devised an algorithm which produces n + k of pi digits after processing n pi digits - without knowing anything about how to compute pi, or what pi is.

That's not a good example. GPT-3 can't actually do this. In fact, no technology we know of can do this for a k sufficiently large and with accuracy better than chance. Personally I find GPT-3's text generation very underwhelming and not anywhere near the magickal guessing machine your example seems to describe.

Each attention block in the Transformer models a fully connected graph (with the attention heads being learned edge attributes). A graph is the most general data structure possible, so yeah, I don't think there's really a fundamental limitation to them, just a computational one. Latest papers from ICLR explore how they fully model CNNs and RNNs, for example, and I'm sure papers on their theoretical equivalence with GNNs are coming.

I'm not sure we want to work with "the most general data structure possible". We can use a Hopfield-like network where every neuron is connected to every other neuron and to itself. It probably won't be very useful. NN design have been moving from more general to less general architectures.

It is implied that at least in theory, the promise is to use them with the similar total complexity (like numberr of parameters and amount of required calculation), in which case yes we do want the most general data structure possible. If we can have a more general data structure that provides similar performance characteristics, it is easier to apply, to debug, to understand, and it likely means that we have found something more fundamental about the underlying world in general.

I think you're conflating "generality" of NN architecture and "generality" of the inductive bias the NN models.

> I'm sure papers on their theoretical equivalence with GNNs are coming.

Not exactly a paper, but you mean something like this article: https://graphdeeplearning.github.io/post/transformers-are-gn... ?

Pretty much that, yeah. Thanks for the link. That kind of article sets the intuition, and papers like these give it a better theoretical treatment:

- https://openreview.net/pdf?id=HJlnC1rKPB

- https://openreview.net/pdf?id=ByxRM0Ntvr

About a month ago, Open AI demoed a language model that does python code generation, using a natural language prompt. It was trained on Github repos.


This is extremely impressive, but 1. It's a cherry picked demo (no paper has come out yet), and 2. This isn't really what I mean by "novel programming task".

I'm thinking more in line the type of questions you might get in competitive programming.

It is not only unsatisfying to ML researchers, but pretty much everyone. We are mere prediction machines and AI is going to render us dispensable bags of meat using an embarrassingly simple algorithm that you could explain a bright child within a day.

You don't find Image GPT[1] novel?

I'm not aware of anything else showing how effective the transformer architecture is on image tasks, and I think that is a pretty fundamental breakthrough. I realise it's not GPT-3 (it's based on GPT-2), but this seems more a project resourcing issue than being fundamentally unsatisfying.

[1] https://openai.com/blog/image-gpt/

I wonder they could do with it if the us decided to do an ai moonshot. They would be the equivalent of spending 500 billion in today’s dollars over 10 years. That’s a lot of compute.

If you do the math on best-case costs for some of Google's AI projects they've been spending tens to hundreds of millions worth of compute on things like training StarCraft bots, so I wouldn't be surprised if the industry as a whole is throwing $50b USD worth of compute at various tasks every year already. If you're a company like Amazon or Google using your own fleet for it you're not actually spending that much money, of course...

> I still think there's tasks that GPT-3 has no chance of tackling (say, generating code to solve a novel programming task)

Umm https://www.youtube.com/watch?v=y5-wzgIySb4

That was a smaller model fine-tuned on Github IIRC.

I don't really find that very impressive at all.

1. Scientifically, synthesizing really simple loop-free programs from high-level specs is not really new. You can synthesize programs at this level of complexity with a few minutes on a single ten year old laptop using 5-10 year old algorithms.

2. Scientifically, it's already known that the architectures used in nlp models are also useful for simple program synthesis tasks.

3. From an engineering perspective, here was a good HN discussion about this a few weeks ago. IMO this comment hits the nail on the head (https://news.ycombinator.com/item?id=23256333):

> This technology converts writing code into bug hunting in pre-written code. Finding bugs in code that you did not write is way harder than writing the code yourself. So if anything, this makes programming harder, not easier, and we will need more programmers, not less.

Rings true for this example. I can write my intended implementation much faster than I can read the generated code, find the bug, and think of a comment that captures the correct spec.

It's interesting, but it's not "generating code to solve a novel programming task".

I agree with you and I doubt we'll get useful AI-generated code before we get strong AI, and by the time we get that we'll have much more interesting applications to showcase I'm sure.

I can definitely see this tech leading to ever more clever linters though. Having an AI reading my code, interpreting the comments, variable and function names and use that to look for bugs in my implementation could be highly valuable, assuming that the number of false positives isn't overwhelming. "Your comment says that you apply the discount only on palindromes but your condition here is wrong and your code ends up doing the opposite".

We already have AI-generated code. It's just that compilers work well enough to actually use, we understand their limitations, and so we stopped calling microcode compilers "automatic programmers" back in the 80s.

I think we will have useful source code synthesizers for high-level languages at some point in the next few years. They will probably look more like alphazero than gpt-3 though.

I'm not sure that they will be commercially successful. The way you automate away programmers is Wix, not Ruby on Rails program synthesizers.

The next step could be WASM synthetiser, though.

Yeah. What I want is the inverse. Get me some AI that predicts which lines of code I write that are likely to cause bugs. Kinda like a linter meets a fuzzer.

This is a motte-and-bailey argument if I've ever seen one. It's true that it's simple code. It's true that it's not going to replace programmers anytime soon. It's true that this is not novel computer science. But this is clearly a novel program and it's also clearly not something that other methods could have done.

> But this is clearly a novel program

"not novel code" here was referring GP's "novel programming task", not the synthesis method. I think we're probably using different definitions of "task". Where you mean it in a very particular sense (this exact piece of code) and I mean it in a more "writing if-else blocks inside a single procedure with no loops and no recursion using functions that are in-scope" sense.

The proper way to determine if there's anything interesting here would be to run gpt-3 on some existing program synthesis benchmarks. Literally any program synthesizer can look super impressive if you just show one working example in a yt video. My suspicion is that gpt-3 isn't going to do particularly well on those benchmarks at least out of the box, and that getting it to work as well as sota would require a bunch of non-trivial engineering work.

You have a much rosier view of program synthesis than I do. Could you link a paper that you think is particularly impressive? I know Idris can do trivial inferences interactively, but I don't know anything that can do anything non-trivial that isn't also very slow and very unreliable.

IIUC, the Generalized Program Synthesis Benchmark Suite[1] is still mostly unsolved, including problems like “Given three strings n1, n2, and n3, return true if length(n1) < length(n2) < length(n3), and false otherwise.”

[1] http://cs.hamilton.edu/~thelmuth/Pubs/2015-GECCO-benchmark-s...

Oh, no, sorry, I don't.

My point wasn't that current program synthesis is particularly great, although I do think modern program synthesis tools can probably beat gpt-3 on lots of problems (and allow that the other direction is probably true too...)

My point was that I'm skeptical that GPT-3 would do particularly well on those benchmarks without lots of additional effort. And then, since you can build pretty much anything anyway with enough blood and sweat, the actual question is: would the same amount of effort poured into an alternative approach generate the same/better results but in a way that's far easier to interpret/extend?

It could work. But the yt video alone is more "huh, interesting" than "wow, impressive". If that makes sense.

Well the key difference is that you don't have to think much to get a code-specialized language model, and when you do train one it's much more general (eg. inferring constraints from text, using user-provided types, correctly naming variables, less prone to exponential complexity as sample length grows, etc.). And then the model just gets better over time as AI improves, and all you have to provide is a comparatively cheap bit of compute.

I got the impression from you saying “You can synthesize programs at this level of complexity with a few minutes on a single ten year old laptop using 5-10 year old algorithms.” that you thought this was generally solved at this level of complexity, rather than merely true for an easier example here and there.

Maybe it would be helpful if you gave an example of the simplest python function it won't be able to synthesize, and if/when they release the code GPT into the API we can test your prediction.

Relating to 1), check out miniKanren if you're interested. It's a wonderful piece of software: http://io.livecode.ch/learn/gregr/icfp2017-artifact-auas7pp

That's not a "novel" task though. It's an extremely common pattern.

>Like the genomics revolution where a few far-sighted seers extrapolated that the necessary n for GWASes would increase exponentially & deliver powerful PGSes soon, while sober experts wrung their hands over “missing heritability” & the miraculous complexity of biology & scoff about how such n requirements proved GWAS was a failed paradigm, the future arrived at first slowly and then quickly. Yet, here we are: all honor to the fanatics, and shame and humiliation to the critics!

No. Actual geneticists are still pretty skeptical about GWASes because they tell us almost nothing about, well, the genetics behind complex traits. It's all good and well running GWASes for literally anything (see: twitter.com/sbotgwa or that dude who got a pretty good PGS from correlating a country's GDP with the genotypes of Arabidopsis thaliana available in that country) but that's virtually useless for serious research or if you want to know how genes work.

Actually figuring out to what extent a trait is genetically determined usually involves much more complex methods (e.g. mendelian randomization) and knock-out experiments on animal models, which is all terribly expensive and tedious. But that's how actual genetics works, not waving a magic wand of +15% heritability.

Agreed. To put it simply, if you model a complex system with another complex system, of which you have a similar level of understanding, then even if the model is very faithful, you gain very little. Unless the model is transferable, so you can predict, which AFAIK is not the case here.

I think the point is not that GWAS can do everything, but that many people 5-10 years ago thought that they had a fundamental flaw - where was all the heritability? It has turned out so far that the heritability is there and discoverable with big enough sample sizes. But of course, you are right that GWAS don't tell us much about the very large space between strands of DNA and social or behavioural outcomes. They are a tool like others. (Mendelian randomization is hardly a panacea - how credible are those exclusion restrictions, usually?)

I agree, holding up GWAS as some kind of lesson about scientific progress seems pretty silly. I don't think the 'sober experts' deserve shame and humiliation considering that PGSes have had minimal impact on human health.

Unless something has drastically changed in the past 5 odd years, Actual Geneticists (tm) happily use GWAS in many investigations of complex traits. They're often an early step in the overall pipeline in which they trawl for potential candidates for that target of interest, after which they go on to said more complex methods.

Why? You've already said so yourself. Those are expensive and tedious, and searching across the entire pool of human genetics with them is an exercise in futility.

The important part in your post is

>They're often an early step

I think we're in agreement here, I'm just arguing that "woah, look at all those correlations" isn't a breakthrough or 'genomic revolution' in any sense of the word as far as our understanding of human genetics is concerned.

Seemingly, most of the complex traits for which GWA is an effective methodology (a few SNPs of large effect size) have already been discovered. More and more these days, I’m seeing association studies that fail to yield any hits. Whether this is due to sample size, polygenicity, or some other model failure remains to be seen.

Can you post the link about the GDP/A. thaliana correlation? That sounds hilarious, and Google is failing me to find it myself.

Throws a Twitter error for me.

GWAS have been shown to be practically meaningless. How should this paragraph be interpreted, that transformers can make sense of DNA patterns?

Who cares how genes work? The important part is the real predictive power of the PGS.

'Predictive' is just a glorified word for 'correlating'. Staring at a bunch of correlations only gets you so far and doing only that certainly isn't science. If you have no idea how genes work you're going to be dismayed when all your fancy correlations don't work anymore, as people gathering genomic data from outside the European biobanks are starting to find out.

Nope. You're wrong. Where do you think your MRs or your sibling comparisons are coming from? You're not getting anywhere with weak instruments from a few significant hits. You're also wrong that the only thing of value is inferring "how genes work", and that is the sort of extremely blinkered mechanism-centric attitude which blinded people to GWASes working, because gosh, it would be awful if polygenicity was true, because how would we build any 'scientific theories' on this? (Cue Turkheimer.) And yes, those PGSes are useful for all sorts of things like selective sweeps, enrichments, and clinical instruments, because of incremental validity. (By the way, what does '15% heritability' refer to? I sure hope that, since you're claiming to be an expert here, you aren't confusing heritability with PGS power, like so many supposed human geneticists insist on doing...)

I didn't say GWASes were useless, just that it's absurd to consider them to be a 'revolution'. The actual revolution would be second- and third-generation sequencing which enabled GWASes and a bunch of much more useful things. GWAS is, in effect, just a bunch of correlations. It's just the very starting point to an actual scientific analysis, because 'you have to start somewhere'. If you don't go beyond and investigate, you're not doing science. Everyone in the genomics community agrees to this, and literally every paper that investigates the causes of genetic diseases goes in the introduction like 'GWASes sure look nice but we still have no idea how things work with them so in this paper I present a method to do...' I notice you failed to address many of the spurious correlations drawn by the GWAS bot or the A. thaliana vs. GDP prediction. That it doesn't raise any red flag to you doesn't speak well as to your ability to approach the field of genomics.

>You're also wrong that the only thing of value is inferring "how genes work"

Yes it is, that's literally what genetics is about. Otherwise you're back to making a bunch of correlations. If you want a deep understanding of disease, design effective drugs, or even do proper gene editing, you have to understand what genes do. It seems ridiculous to have to say it.

>By the way, what does '15% heritability' refer to?

It is what we in the less-rationalist circles refer to as a 'joke'.

I don't know much about ML but I wonder what the need for so much training data (apparently about 500 billion words for GPT-3) means for this approach. Humans achieve their performance levels while only ever observing a tiny fraction of that training data --- at least in word form. I see only two possibilities:

-- Human brains somehow are able to learn from all forms of sensory input and world interaction in ways that pay off in word tests. (Even then, would the information processed by an average human match the GPT-3 training corpus?)

-- The human brain has a very different architecture that lets us learn vastly more efficiently from small amounts of training data.

Is there another possibility I'm missing?

If the latter is true then that gives humans a durable advantage over GPT-3-like approaches.

> Is there another possibility I'm missing?

Possibility: The human brain and GPT-3 are doing radically different things and aren't even comparable. GPT-3 is merely memorizing enough language to pass a turing test, whereas human brains are actually learning how to use language to communicate with other humans.

Evidence: Have GPT-3 write your work emails for a day, and then live with the consequences of that for the next week. It's going to produce text that makes sense, but only in a very particular way. You will end up with email threads where an outsider says "yeah looks like work emails. I believe this is two humans" And, that's very impressive! But your actual interlocutor who understands the full context of the conversation and relationship will legitimately worry for your mental health and maybe even contact your manager.


1. Any time you invent a test, people will find ways to pass the test with flying colors but completely miss the point of the test. The turing test is no different.

2. Being able to imitate a human well enough in a five minute general english conversation is only so useful. There's a reason we don't pay people living wages to write internet comments. This isn't to say that GPT-3 is useless, though. There is certainly demand for very specialized five minute conversers that come at zero marginal cost. I'm worried.

3. We still have no clue how to even begin to approach AGI.

I think that you make an interesting point - that these transformer models can produce reasonable text only freed from context, and generally the useful thing is to produce reasonable text taking context into account.

But how far are we really from going beyond the five minute conversation you mention? What if you combine a transformer text model with a chess engine to produce a chess tutor? It doesn't seem like we are too far from a computer program that could teach chess from first principles (i.e. learn chess itself, and teach it in a meaningful way).

Maybe in combination with economic models, it could teach economics?

What else? Perhaps in combination with wikipedia it could teach subjects such as physics or biology? But it would just be regurgitating in that case, not "understanding" what is in wikipedia. We need an existing model to tie to the text generating capability. So what if you take an alpha-go-zero approach to model generation - instead of a human creating the model, it is developed through adversarial learning? Theoretically then, anything that can be formulated in a contest or game could be "learned" and modeled, and then combined with text generation to be taught. That seems pretty powerful, and not so far out of reach.

(Also, I love the idea of letting GPT reply to work emails for a day. Someone should set up some experiments to actually do that, and find out what happens. I bet that even though it would be a disaster, we would learn a ton.)

Yeah, I think that sort of stuff is exactly the future of computing. Just because GPT-{N+1} isn't an AGI doesn't mean that GPT-{N+1} won't cause an exciting step-change in what's possible with computing.

> Someone should set up some experiments to actually do that, and find out what happens. I bet that even though it would be a disaster, we would learn a ton.

Ha! Agreed! Unfortunately I do too much soft external-facing stuff these days to do this myself :(. Someone who interacts mostly internally with engineers/techy types who might appreciate the experiment should totally do this.

You may find this interesting https://www.reddit.com/r/SubSimulatorGPT2

> Have GPT-3 write your work emails for a day

This almost certainly won't work with GPT-3, but it might with GPT-5, GPT-10, or GPT-20.

It depends how data hungry those models get. It might be the case that to train GPT-10 you actually need more data than there is available in the universe, so you have no chance of training it.

Learning from sensory input and world interaction PLUS language is a huge thing. Like teaching a dog to sit. The dog learns about the world with all sorts of sensory input, it interacts with the world, and it hears you speak a ton. But it'll never learn what "sit" means without an explicit combination of language and real world interaction. We do the same. Feedback loops are important too. Teaching your dog to sit, the words and actions you take depend on what the dog's doing. Teaching a baby to talk, your words and actions depend on what the kid's doing and saying, and maybe even asking and answering questions.

Current language models learn language from how we use it, but only in relation to more language. Hopefully, we can eventually get some good data on how we use language in relation to the real world too. I can't imagine where you'd get a massive dataset like that though, shy of having robots out in the world acting like babies.

> Human brains somehow are able to learn from all forms of sensory input and world interaction in ways that pay off in word tests. (Even then, would the information processed by an average human match the GPT-3 training corpus?)

Yes, absolutely humans learn from cross modalities. I haven't seen much work on attempting this in neural networks, but cross modal prediction is known to work well.

> Even then, would the information processed by an average human match the GPT-3 training corpus?

I'd be surprised if it didn't. The visual cortex processes around 8.75 megabits per second[1]. Assuming eyes are open around 16 hours a day that is 63 GB/day of information just from the eyes.

Assuming 500 billion words in the GPT training set, and 5 characters per word on average and 1 byte characters that is 100 GB training set or a bit under 2 days of information for a human.

Now the two aren't directly comparable, but humans gather a lot of information.

[1] https://www.newscientist.com/article/dn9633-calculating-the-...

If this explanation is true, it suggests a very important experiment would be to develop AI models that train on video and (much less than 500 billion words of) text and then tackle these test problems that GPT-3 is evaluated on.


It's not entirely obvious how to do this though. There needs to be a common training representation and that's pretty hard.

In videos, the sequence of frames is important, but each frame is an image. In text the sequence of characters is important.

Maybe something that can accept sequences (of some length..) of bytes where the bytes maybe an image or maybe a single character might work. But unified representations of images and words for training is probably the first step towards that.

based on the current state of the art and the research problems needed to solve it's probably 3 years before we are in a position to contemplate the kind of training suggested here.

Another very important aspect is that humans interact with the world they are observing. It isn’t just passive processing of data. Training a model on video may help over text alone, but the interactivity is still missing.

Probably a combination of factors not a single one. In addition to the ones you mentioned: GPT is also much smaller than a brain (fewer parameters), so it can't form higher-level concepts that might help it to organize things better. And the human brain has priors about the world baked in through evolution something that GPT sort of avoids by working on arbitrary token sequences.

I considered the "pretrained via genetics" possibility but I don't see how that could make much difference. It doesn't look like the genome has enough information to encode really large amounts of brain structure details.

It's not really pretraining, more of an architecture optimized for the problem-space the brain is dealing with. There are specialized regions for different tasks while a transformer is uniform. Being uniform makes it easier to scale, but it probably means while being generic it's also kind of inefficient so it needs way more training and parameters than an optimized architecture.

Brains are also quite energy-efficient, which makes parallel search of architectures a lot cheaper. A small rodent having a handful of offspring every year is equivalent to building new supercomputers with specialized hardware beyond current capabilities on a tiny budget and then using that to train a model and seeing whether it's better. Researchers don't have that luxury so they have to compromise by running a very generic model that is flexible but inefficient.

That makes sense but it kind of falls into my "brain has a better architecture" possible explanation. What's interesting is that the thrust of the GPT approaches is "architecture doesn't matter while we can just keep building bigger models".

It would be surprising to me if AGI is achievable via two such different pathways. I don't know this area so I'm willing to be surprised.

> That makes sense but it kind of falls into my "brain has a better architecture" possible explanation. What's interesting is that the thrust of the GPT approaches is "architecture doesn't matter while we can just keep building bigger models".

Those are not mutually exclusive. Architecture doesn't matter asymptotically but it does matter for task-specific performance for any concrete complexity budget. Compare to Big-O behavior of algorithms (asymptotes) vs. doing hardware specific optimizations (cache-friendliness, hand-rolled assembly, outsourcing specific parts to ASICs...).

OA is doing the former, looking at asymptotes. Nature has done that too (mammalian brains scale from rodents to humans) but also has applies all kinds of task-specific optimizations. Compare the cerebral cortex and cerebellum for example, the latter acts similar to an application accelerator but if it doesn't work software fallbacks are possible albeit slower.

> It would be surprising to me if AGI is achievable via two such different pathways.

Have you considered cephalopod or bird brains?

There's a lot of stuff, language for example, that's mostly genetic. Humans aren't learning these things from scratch, generally they're just filling in some blanks and helping guide the development of the brain.

see: https://en.wikipedia.org/wiki/Universal_grammar

I generally agree, but I think there is something else going on too. Conceptualization is highly flexible and comes in really early. Little kids ask "why?" a lot. Why do they? There must be some energy minimization reason for it. I don't think we're going to get AGI without it being able to ask some questions and get some answers that fundamentally update its understanding of the world.

There's also human's ability to probe the real world and understand and create new constraints. Putting a square block into a circle shaped hole teaches you new things.

> -- The human brain has a very different architecture that lets us learn vastly more efficiently from small amounts of training data.

I think this is right, but its misleading perhaps to identify this as 'efficiency' in a computational sense - we're talking trillions of extra parameters derived in the brain from the tiny fraction of training data

It is not entirely clear how it would be correct to calculate the amount of data required by a human language center to learn that much. Do you count all the random thoughts in someone's mind? What if those thoughts, and also some subconscious processess are constantly running in that network?

Input to the human (like audible words said in front of them), and input to the network (whatever data that part of the brain receives from anywhere, both from outside but also from other parts of the brain) are two very distinct things.

> gives humans a durable advantage over GPT-3-like approaches.

I think it is pretty clear to everyone involved that there are, empirically, so far, numerous very important and obvious and deep advantages that humans have over a GPT-3. This doesn't mean that GPT-2/3 aren't a tremendous achievement in AI.

> Humans achieve their performance levels while only ever observing a tiny fraction of that training data

Individual humans only ever observe a tiny fraction of that training data, but our brains are not the products of solely our individual experiences. We come out of the womb with a massive amount of evolutionarily trained, genetically encoded weights in our neural network.

Where are those weights encoded? It doesn't seem like the genome has enough bits to encode a lot of weights.

I agree with other people in this thread that structure/architecture is encoded, but not weights.

> Humans achieve their performance levels while only ever observing a tiny fraction of that training data

This is really the key detail and the hole in Gwern's argument that AGI is around the corner. You can't just compare the result. You also have to look at what it took to train the model and at what the model is actually doing.

If you look at GPT-3's output, it only superficially makes sense. Is there evidence of true understanding or is it just a really really good text generator?

Regardless of AGI though, I do think that models like this will eventually mean the end of social media and perhaps all wide open discourse on the Internet. When this stuff gets easy and cheap enough that spammers and propagandists can use it, it's over. How much money in compute/storage would it take to train GPT-3 to advocate for Donald Trump or Joe Biden all day long, or to shill products, or to just generate superficially comprehensible text to execute a kind of denial of service attack on a community?

I don't buy the "we have GPT-3, therefore we may soon have artificial general intelligence" (AGI) notion.

AGI might happen tomorrow; it might happen in decades, in centuries, or never. GPT-3 is basically a straightforward scaling of GPT-2, but I see no evidence that simply scaling GPT-2 or GPT-3 will lead to AGI. The problem is we don't know what else is needed.

Maybe the end of anonymous ungated social media, yes. We can still talk to humans and sources we know and trust. Probably that's what we need to do already.

given the idiosyncrasies of GPT style text generation, the cost of detecting generated text is likely to be orders of magnitude lower than the cost of generating the text. Such detection may also capture low-quality content on the internet and prove a boon for high-quality commentary.

Whether humans have a genetic language facility is not yet known.

It is also kind of a moot point. Whether or not there is a genetic component, it is relevant only as far it is cashed out in physical brain matter and its arrangement.

Just so we're all on the same page, this is a random GPT-3 sample (I clicked "Random" twice, the first one was a short Wikipedia-like article, this was the second):


As far as I'm concerned, I would probably pick this over a random fanfic in a Turing Test.

I was very impressed, but two clicks later landed me here:


This reads exactly like bad plagiarism, with the word count padding repetitive phrases and too-on-the-nose facts. It's not generating, it's memorizing.

> This reads exactly like bad plagiarism

Are you also saying that it is plagiarism or that it just look like it? To be honest considering the huge training sets and knowing nothing of the testing methodology I have some lurking suspicion that it could be just copy-pasting chunks of texts...

On the other hand if it is just writing "articles that feels a lot like they are plagiarism" then I suppose it doing its job properly considering what you find on the internet.

I guess it's both. It reads like plagiarism of bad plagiarism.

Well to be fair it's learning from the internet, where there are lots of low-information content and repetitive web pages. I think also the lack of formatting makes things seem more artificial than they are.

I don't think a film analysis written by a human would consist of five separate introductions.

That's amazing. The only problems I caught are that there's a spurious double quote right at the start, and Murphy seems to have two children, a son and a daughter, but then later refers to "the boys".

To really evaluate these samples, we need a way to search for phrases in the training data, to see how much is just learned and copied. I've tried Google and DDG, but not found anything.

I don't know much about these kinds of neural nets but this is the thing I've wondered about - with billions of parameters and enormous training sets, it seems more than likely that it's squirreled away fair chunks of text inside itself that it can regurgitate.

Edit: I pasted the random sample above into Grammarly's plagiarism detector and it says it "detected plagiarism" but I'm not paying to sign up just to find out what it said (also they make you sign up with your email address before mentioning that it costs money... rude!) Maybe someone with a Grammarly subscription can try it?

Edit 2: Getting off-topic but wow, Grammarly is a nest of dark patterns. Doesn't tell you you need to create an account until you've pasted your text in. Doesn't tell you you need to pay a subscription until you've created an account. Sends you a follow-up email a few minutes later if you don't subscribe. Puts the account deletion link in a different colour and location to the rest of the account settings so it looks like a footer. Swaps the styling for the 'delete account' and 'keep account' links when you do find the delete button to try and steer you away from deleting your account. After that lot I'm never giving them a cent.

I also wonder how much is invented, and how much is simply copied. I saw this one by GPT3


and compare it to e.g.


GPT-3 has quite a lot of correct facts, manages to mix fouad belkacem and Sharia4Belgium into it, etc... I suspect this is what happens when there is not enough training data: You get a lookalike, not something new.

To be clear: It's impressive what GPT-3 is capable of, and it's very probable something better is comming in the near future.

(It's also quite amusing how a lot of these terrorists get first killed by the police and then thrown into jail for 20 years. That'll teach their corpses! )

> I also wonder how much is invented, and how much is simply copied

Well, the average human probably does not excel at this metric either

Most paragraphs read fine, but the story itself is fairly incoherent.

What human like abilities would a scaled up version of GPT-3 have?

"Would it be worthwhile, even if it represented another large leap in AI capabilities, to spend up to 10 milli-Manhattan-Projects to scale GPT-3 100x to achieve human-like performance in some domains? Many researchers feel that such a suggestion is absurd and refutes the entire idea of scaling machine learning research further, and that the field would be more productive if it instead focused on research which can be conducted by an impoverished goatherder on an old laptop running off solar panels. Nevertheless, I think we can expect further scaling." [1]

[1] https://www.gwern.net/newsletter/2020/05#gpt-3

Solving Winograd schemas would be a pretty interesting and significant step forward.

I released a large number of GPT-3 demos yesterday: https://github.com/minimaxir/gpt-3-experiments

Well, I take the the opposite stand: GPT-3 = Give a 3rd grader wikipedia and some paper. While it's certainly fun how it's relatively coherent in writing text, I couldn't see yet, how it links syntactically correct text to actual facts. Which in my opinion is the difference between 1e100 apes with typewriters and aforementioned 3rd grader.

I think gwern's point (whose opinion on deep learning I respect a lot more than his on genetics) is that GPT-3 is nearly not the end of the story. OpenAI released a GPT pretty much every year and each one is a spectacular improvement on the previous one with no sign of that trend ever stopping. If size is really what matters at all, there's no telling what GPT-4 or GPT-5 might be capable of, let alone a giant GPT run by state-sized actors.

The machine that trained GPT-3 is one of the largest GPU clusters in the world at the moment. (10k v100 GPUs hooked up with fast interconnect).

For comparison, Nvidia's "Selene" supercomputer only has 2k a100 GPUs and is #7 fastest supercomputer in the world: https://www.hpcwire.com/2020/06/22/nvidia-nabs-7-spot-on-top... a100 GPUs are around 3x faster than v100 GPUs.

It would take some time for governments to catch up :)

I think the point Gwern makes is that if a government wanted to they could easily allocate enough resources to do this. More bluntly I think he's saying if the US decided tomorrow to begin a Manhattan Project for AGI there is a non-zero chance that they might succeed in 7 years.

If they started a Manhattan Project for AGI, would we know by now? Or only after it explodes?

Only after, secrecy and surprise is the whole point.

1. This is a list of publicly known supercomputers (it’s likely that the intelligence community have larger secret assets).

2. Most of the largest super computers are owned by states.

Lots of large companies don't include their clusters on those supercomputer lists. Including Google. IDK about DoD, but I wouldn't be surprised if they have top 10 clusters we don't know about. The list of largest supercomputers is definitely only a list of the largest disclosed supercomputers.

That said, only so many actors could've secretly sourced 10K+ v100s with fast interconnect...

> GPT-3 = Give a 3rd grader wikipedia and some paper.

I don't know if you're being hyperbolic here, but if not, I consider that a massive step forward for AI.

It's not a 3rd grader.

This kind of weird characterisation that people keep bringing up is totally wrong. It's more like a savant, who can generate random stories that have no bearing on reality.

It's just a bit better than GPT-2, which spat out mostly incoherent crap.

What's interesting about this is that it seems like the approach still seems to scale, and at some point, it might make something that actually generates useful output... and the ability of the model to handle general NLP tasks is a bit better.

So yeah, it's interesting, but no, it's not a massive step forward for AI, in the way having an actual 3rd grader would be.

I don't think a third grader would ever be able to write anything like e.g. https://read-the-samples.netlify.app/sample_1986 this. Like it or not, this dreamed-up story is internally coherent in a truly impressive way, and even more impressive is how it stays on-message throughout.

Grammarly says that text was plagiarized and if that's true, it's not a surprise that it's coherent.

and plagiarizing/wrongly paraphrasing texts is something every 3rd grader should be able to do. And yeah, generally a 3rd grader should be able to stay on topic as well ...

And additionally he can draw, do image recognition, run circles, climb trees, pick fruits and mine rare minerals. Seems like we already have the business proposition of most AI businesses!

> Please share this article to show Trump’s new racist ad to everyone, because it is disgusting!

Ouch, we've all just been burned by an AI!

I'm more scared than burned. If I saw this text in the wild I wouldn't be able to tell that it wasn't written by a human.

For me the giveaway is that the level of hyperbolic content doesn’t match the level of textual content. It reads like subtle spam to me.

"It's more like a savant, who can generate random stories that have no bearing on reality."

That's actually just a particular special case of how this model is used, where they use the model to predict "the rest" of a story. It is not the nature of the model itself, and it can be used for other things, such as compression or other NLP tasks.

It's also interesting for the fact that it has only been trained on text. There is a big unanswered question as to what would be possible if this were adapted so that it could be cross trained on other data that would provide a basis for its text output.

how would you propose to train it on multiple kinds of inputs/targets?

I think fock's point is not about the 3rd grader's capability of reasoning, but only about their proficiency in writing syntactically correct sentences. GPT-3 does zero reasoning.

What are you calling reasoning? Is reasoning something entirely separate from choosing what word is expected next in a sentence based on the logic of the words? You can give GPT-3 things like "Alice was friends with Bob. Alice went to visit her friend _____" and "10 + 10 = 20. 20 + 20 = ___" and get right answers. You can tell it the definition of a made-up word and ask it to use the word in a sentence, and it can come up with a sentence that uses it in a relevant context. You can give it a short story and ask it questions about the story. All of these go far beyond basic syntactic concerns like putting subjects and verbs in the right spots.

> Is reasoning something entirely separate from choosing what word is expected next in a sentence based on the logic of the words?

Yes, otherwise a markov chain would be reasoning, too.

> "10 + 10 = 20. 20 + 20 = ___"

You can do the same in Prolog and similar languages. Is the language/the compiler reasoning?

> All of these go far beyond basic syntactic concerns like putting subjects and verbs in the right spots.

I'm not sure about that. The output obviously is a wonderful achievement, and will find a multitude of applications, but isn't it technically still a stochastic model describing the probability of a sequence of events, albeit at unprecedenced scale?

Reasoning needs conscious thoughts, sentience, awareness, and we don't have the slightest hints these are present in GPT. Yes, humans reason about something and then write a coherent text, but that doesn't mean that the presence of a coherent text is proof of reasoning - just as the absence of the capability to produce coherent text isn't proof of the absence of reasoning (e.g. in the horrible cases of the locked-in syndrome https://en.wikipedia.org/wiki/Locked-in_syndrome )

>> You can do the same in Prolog and similar languages. Is the language/the compiler reasoning?

Yes, absolutely. The Prolog interpreter is a resolution-based theorem prover. You will be very hard pressed to find any AI researcher arguing strongly that "that's not reasoning". In fact I believe automated theorem proving is one of the very few cases were you can find so little disagreement about whether something is "reasoning" or not.

Note also that "reasoning" is not necessary to solve a problem like "10 + 10 = 20. 20 + 20 = ___" (1). Knowlege of addition is sufficient: given knowledge of addition the result of "20 + 20" can be derived without reference to "10 + 10 = 20". And, absent knowledge of addition, even if a system can answer (1), it will not be able to answer the majority of similar problems, indicating that it has only memorised the answer to (1).

A better test of what a system has learned is a question like "10 + 10 = 30. 20 + 20 = ___" (2). The answer to that should still be "40", but again that's not because of any reasoning; it's because "20 + 20" is always "40", even when preceded by a false statement. So this kind of question is really not any way to test reasoning abilities.

Edit: Actually, "Alice was friends with Bob. Alice went to visit her friend ___" (3) is not a very good test for reasoning, either. If I were to answer "Alice", would you be able to say whether that's true or false? The only way to make such a decision is in the context of a "closed world assumption" (incidentally, central to Prolog's theorem proving). However, now you're making a much more precise claim, that "GPT-3 has learned to answer questions by making a closed-world assumption". You can test this claim much more convincingly by asking questions like "Alice is friends with Bob. Is alice friends with Alice"? The answer should be "no" (or "false", "incorrect", etc).

Has this kind of more formal test been carried out, with GPT-x?

> You can do the same in Prolog and similar languages. Is the language/the compiler reasoning?

Yes, of course. Reason isn't magic.

Pro log is literally ‘logic programming’. It’s right in the name.

No, the compiler is following rules humans implemented. The humans were reasoning. The compiler follows a well-defined process. This is also what a scientific calculator can do - but the calculator isn't an example of AI, either.

Many linguists think that every human language follows fundamental patterns (e.g. [0]). In that context, the achievement of GPT is that it indirectly derived such a model by working through ungodly amounts of data. The results sound meaningful for us - but that doesn't imply that GPT intended meaning.

Every theory of reason I know has consciousness as a hard requirement. I'm not trying to be pedantic, but the topic of this thread is exactly the kind where clear definitions of words are important.

If Prolog is reasoning, then a scientific calculator is, too. But now we just need another word for the thing that differentiates us from calculators.

[0] https://en.wikipedia.org/wiki/Principles_and_parameters

What sort of definition of reasoning implies or requires consciousness? I haven’t seen one.

Those from Locke, Hume, Kant, Habermas, among others.

Is that then really a definition that carves reality at the joints? What is it that the AI will actually be hindered in doing, that is described by its inability to meet this definition?

If I think about a process using a certain mechanism, and the AI thinks about a process using a similar mechanism, but also I have a consciousness attached on top and the AI does not, then it seems petty to assign these processes different labels based on a component whose mechanical relevance is not shown. I'm not doubting the impact of conscious, reflective reasoning on human capability, mind! But most of the thinking I do is not that.

Also as a general rule, you should be skeptical of considerations of reason that are based largely on introspection; the process is inherently biased towards consciousness as a load-bearing element, since consciousness is so heavily involved in the examination.

These are very good points! Current theories of reason are obviously assuming human minds. Still, even if one wants to create a new definition that includes AGIs, there has to be some concept of agency, of wanting to achieve something, with the capability being the means to that end. The capability alone isn't what brings us closer to AGI.

well, the same way, a neural network follows rules humans implemented. With a little bit of mathematical optimization to actually describe a problem!

I think he meant GPT-3 does zero human reasoning.

Do we know that humans do reasoning? People talk much much faster than they could work out anything like what people consider to be logical reasoning.

Remember that logic had to be invented by Aristotle. It was a mechanical system meant to approximate how humans make decisions naturally.

Well, of course. It’s not human. Even a superintelligent AGI would likely do zero ”human reasoning” unless it wanted to do authentic emulation of a meatbrain for whatever reason. Like keeping uploaded humans running.

If the AI is at a 3rd grader level, that means it's already reached human intelligence!

In my opinion, in some regards the AI performs well above a third grade level in some areas (being able to write a complex paragraph, terminal completion, writing code) and below in some other areas (not knowing basic multiplication, doesn't understand spatial context).

A confused idea people have about AI: that the goal is to make a computer that thinks like a human. There's not much point in that: we have 7 billion humans, and making more is cheap and fun. At least for now, AI is interesting when it complements, not substitutes for, human intelligence. We want Google to be able to search a billion web pages. We don't care if it can't make a decent cup of tea.

You are missing a huge part of the picture.

If we can make an AI with an intelligence of a human, we can train that to be very good at some problem (like a human expert in some area). And then we can clone those intelligences into thousands in order to progress in that area much quicker.

A human-level intelligence is also a very important goal because it, even by definition, produces the singularity point. (AIs that can make better AIs, exponentially.)

> We don't care if it can't make a decent cup of tea.

We absolutely do, although it depends on the price at which they can do that. But once an AI can do that at all, it is only a matter of time before we can make them do it cheaply (we are very good at making things cheap once we know what things we need), at which point it becomes a driving force in our progress. Once there are self-driving cars, there will be a revolution in the labor force, for example, and driving is only a small portion of what a "human-level" intelligence is capable of.

Yes, I think that in the end, this would be true. AGI itself would be transformational - partly because it would be easily linked with computers' "superhuman" powers of memory, throughput etc.

But, on the way to AGI - which seems to me a long way off, despite some of the interesting arguments made here - it's not really true that "being very humanlike" is a big commercial advantage. As long as we have (say) less than the intelligence of a 2 year old child, or a chimp, then we'd rather have computers be doing very un-human things.

"Human-level intelligence/performance" is a term that is often used in many ML tasks to indicate a top-level performance by a very performant human, to compare the performance to, performance at that specific task which is being discussed. Perhaps not world's best human (but sometimes, like in AlphaStar), but at least someone competent at the task (for example in https://github.com/syhw/wer_are_we). It is just a term to use, to gauge and compare how well a network operates.

You could of course find cases where this term makes other sense (or does not at all), since English is a flexible language, but I think that in areas where we obviously discuss AI/ML, let's just use the de-facto term and make everyone's lives easier.

> making more is cheap and fun

A common quip, but deeply inaccurate. Making productive agents who will advance your community is not cheap (neither in money nor in effort), and while some parts are fun, many parts are extremely not-fun. Productivity is clearly part of the goal of AI.

If the best AI you get is one that thinks like a 4-month-old fetus, making those is cheap and fun.

>Give a 3rd grader wikipedia and some paper

If that's your perspective, that would be considerably staggering progress.

There is an implicit goalpost for AI that laypersons have that assumes if it can't kill us all then it's not "real AI."

But if they kill us by accident its not AI eg. Tesla Autopilot.

Right...which is consistent cause the "AI" doesn't have volition.

Does a third grader not have intelligence?

And in a few years it might be a forth grader. Then maybe a few years after that, a fifth grader. Slow progress is still progress.

and at some point the energy input is larger than producing the fifth grader and a generation of teachers. Which ... - seems inefficient, especially given that you can't teach this machine any new rules in a closed-model fashion; say, language finally adapts a nice grammatic form for non-binary-and-cis persons alike - how would you put it in this model?

When I read the output of this model I'm really quite impressed. However, given the sheer size of it and huge training corpus, to what extent is it just regurgitating source text?

I guess we could ask the same question about ourselves. How much of what we say is just a regurgitation of what we hear/read every day?

This is a common question and the authors answer this in their paper by checking the training set when evaluating it on particular tasks. E.g. for the addition task they search for existing samples of addition in the training data and find that not all are represented.

"What should we think about the experts? Projections of failure were made by eminent, respectable, serious people. They spoke in considered tones of why AI hype was excessive and might trigger an “AI winter”, and the fundamental flaws of fashionable approaches and why brute force could not work. These statements were made routinely in 2014, 2015, 2016… And they were wrong. I am aware of few issuing a mea culpa or reflecting on it. It is a puzzling failure, and I’ve reflected on it before."

Who are these experts? Where are records of these routine statements? Seriously, I am an AI researcher, who said this?

Here's a good example from 2018:

"Deep learning (does not) scale... We can't just scale up AlexNet and get respectively better results - we have to fiddle with specific architectures, and effectively additional compute does not buy much without order of magnitude more data samples, which are in practice only available in simulated game environments."


But Gary Marcus is the gold standard. Check out his 2018 take-down attempt, which is clearly mostly wrong:


These predictions often surely make most sense and are most prevalent in a particular field of NLP research, like dialogue (open-domain, non-task oriented). Believable dialogue seems to require some kind of world knowledge, certain forms of 'common sense' (e.g. arithmetic reasoning), and the ability to track belief states and be coherent over long periods.

GPT-2 was critiqued for its inability to deal with simple arithmetic questions, as well as contradicting itself over long periods [1].

GPT-3 has similar deficiencies in terms of sensible real-world knowledge and displaying coherence both with its own previous answers and with a more general real-world knowledge:

'Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type “If I put cheese into the fridge, will it melt?”.' [2]) of the type that would obviously torpedo attempts at open domain dialogue over long periods.

Take a typical two paper pointing out the failings of brute force methods for dialogue. They trot out a classic example:

'Without getting into the specifics of how these systems are constructed, it is apparent that such an approach [brute force] provides a method for imitating a dialogue but not for participating in one. We can see the limitations of the approach from examples like these: Human: what is your job ? Machine: i ’m a lawyer . Human: what do you do ? Machine: i ’m a doctor (Vinyals & Le, 2015).' [3] Huge models can provide plausible, but not consistently coherent, dialogue inputs. In the dialogue domain, the 'winter' will come (/has come) when it becomes clear that Meena, BlenderBot etc need a little help when it comes to coherence over an arbitrary number of turns, displaying 'common sense physics', and so on.

[1] https://thegradient.pub/gpt2-and-the-nature-of-intelligence/ [2] https://arxiv.org/pdf/2005.14165.pdf [3] https://arxiv.org/pdf/1812.01144.pdf

Yann LeCun has warned of hype causing an AI winter:

Spectrum: Hype is bad, sure, but why do you say it’s “dangerous”?

LeCun: It sets expectations for funding agencies, the public, potential customers, start-ups and investors, such that they believe that we are on the cusp of building systems that are as powerful as the brain, when in fact we are very far from that. This could easily lead to another “winter cycle.”


To clarify, that's not to the credit of the article. The author is basically taking the piss off people like LeCun, sarcastically describing them as "eminent, respectable, serious people" who "spoke in considered tones" (as if that's a bad thing) and wondering why they haven't issued a "mea culpa". At least, I find it a bit conceited to expect the people who built up a field of research from nothing to apologise for being worried that the field might be in danger from overhyping by a flood of newcomers who don't understand it.

Gary Marcus?

I mean, he and one or two other AI-research adjacent people basically have a hobby of bringing up limitations of Deep Learning etc, sure. But if that's all that is meant by this post, this is a weak point indeed...

After IBM Watson won Jeopardy! game Noam Chomsky said: "Watson understands nothing. It's a bigger steamroller". I wonder if he still holds this view after reading GPT-3 samples.

“ but an idiot savant, we should remember, is only a genetic mutation or bit of brain damage away from a normal human.”

I’m not sure this comparison holds.. seems like a chain of fuzzy implications taken as necessary fact.

I retired a year ago from managing a deep learning team. While I am a fan of the technology, I really yearn for more research aimed at hybrid AI systems. Even given one-shot (few shot?) learning, transfer learning, etc., I keep coming back to watching my grandchildren back when they were infants. They could see a picture of a new animal in a picture book, and really "get" why the animal looked different form others, etc. Deep learning will not get us there.

I think we get to general AI by using prior knowledge, exploit deep learning, and also build systems that can develop their own models of the world that they can maintain, modify, discard, and combine.

As always, a great write up by gwern!

Winograd schemas falling at 10T parameters is interesting. That's probably only 5 years off.

If we can build something capable of passing Winograd schema, then it can probably write working non-trivial computer programs from plain text.

Google's PEGASUS summarization model[1] has learned to count up to five (which is amazing!!). That's "only" 568M parameters. It's be interesting to see GPT-3 fine tuned against the PEGASUS objective function.

[1] https://ai.googleblog.com/2020/06/pegasus-state-of-art-model...

For those curious the "counting" is at the end of the article, and really is quite impressive:

>Following this post is an example article from the XSum dataset along with the model-generated abstractive summary. The model correctly abstracts and paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall) as “four Royal Navy frigates”, something an extractive approach could not do since “four” is not mentioned anywhere. Was this a fluke or did the model actually count? One way to find out is to add and remove ships to see if the count changes.

>As can be seen below, the model successfully “counts” ships from 2 to 5. However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as “seven”. So it appears the model has learned to count small numbers of items in a list, but does not yet generalize as elegantly as we would hope. Still, we think this rudimentary counting ability is impressive as it was not explicitly programmed into the model, and it demonstrates a limited amount of “symbolic reasoning” by the model.

It's one of the most amazing and surprising things I've seen in the last 12 months in machine learning (I follow and work in the field).

It surprises me a lot more than the excellent performance of GPT-3 on text generation for example. GPT-3 is amazing but looking at GPT-1 -> GPT-2 -> GPT-3 it isn't surprising. Counting on the other hand is something I wouldn't have expected from a summarizer.

Isn't the entire point of OpenAI's claim that GPT-3 is a few-shot learner that it generalizes concepts, not just syntax?

Yes, and that's very impressive.

But to me that isn't as surprising. Not claiming I would have thought of it, but if you have a very large multi-dimensional space (such as GPT-3) then giving it some examples of something pushes it into that general area of the space.

Generalizing concepts isn't a new thing - one could argue that word2vec from 2014 did that pretty well. GPT-3's "concepts" are vastly more complex than the single word (or maybe 2 word) concepts in Word2Vec though.

I mean in that sense, GPT probably just extracts low-n counts as separate concepts.

I'd love to see an architecture that can keep a separate short-term memory to allow it to count with multiple digits and follow algorithms. On the other hand, given what we've seen from GPT, at that point I would actually worry about it becoming a general intelligence...

> low-n counts as separate concepts

But how would that work?

I agree it probably doesn't "understand" math, but it has learned that number words can substitute for each other in a sentence (three ships/four ships/five ships) which isn't surprising.

But it has somehow learned to link that word with the correct length of the sequence of names, which is astonishing. I can't think of obvious "cheats" that make this work.

The best I can think of is that is has learned to count commas when they are separated by words.

Sounds like my children. For a long time my now-four year old counted like this: “one, two, three, so many!”

Sounds like your four year old was ready to start making inductive proofs!

I'm failing to connect the anecdotes with the conclusion. He's claiming the the ML is scaling well but then gives the data on how GPT-3 is expensively brute-forcing its way to "success".

To me it just seems like what supercomputing is to normal computing: It makes the computationally expensive stuff do-able in a reasonable amount of time, or gives diminishing returns on existing algorithms. But it doesn't magic in any real advancements.

The problem in AI/ML and the concept of "AI winter" to me was always the barrier of the fact that we're just doing predictions, with no deep meaning or comprehension of features. The layman thinks there's magic, but there's not, so when that truth is revealed there will be problems. There's nothing intelligent about our Artificial Intelligence; we're just doing statistics with big data and added sprinkles. OpenAI just proved they could do statistics with even bigger data and more expensive sprinkles.

Has their work really shown we can get past that core problem? Personally, I don't see it.

>we're just doing statistics with big data and added sprinkles.

I mean, you could pretty much say that's how the human brain works, couldn't you?

Well, we don’t really know. I think that’s one element missing from most DL analysis: our understanding of the brain is incredibly limited. And ANNs are incredibly crude imitations of actual NNs. So we’ve built massive, extremely crude approximations of a system we don’t really understand.

As is often the case, the truth is somewhere in the middle. It’s almost certain that we won’t reach AGI without a fundamental breakthrough on soft/wet-ware but it’s also nearly certain that even with the best algorithms we will need to efficiently harness and coordinate massive compute power, as we’re learning to do now.

But thats the point, our brain is clearly a lot more than just a big table of probabilities. You only need to look at the absolutely insane volume of data and training time that these models need. How much time does it take a for human to understand a concept like "love" and what volume of training data is required? Computers would just regurgitate quotes from poetry or novels about love without any real understanding after billions of hours of training time and ingesting every document on the internet. A human can understand love in a tiny fraction of the time and with a tiny fraction of the volume of information processed and they also understand it in a more fundamental way and can articulate that in a way these models cannot.

You might argue back, well the human brain has pre-trained neural networks with billions of hours of training time. Well, that isn't really the case. We don't start off with some pre-existing memory of what "love" means, or what "Physics" is, or trillions of bytes of data. All we have is a capacity to learn which is highly efficient, a conscious mind which is aware of itself, and certain fundamental drives driven by our bodies and instincts. If you have a human child and give it zero input information it will never learn a language or be capable at all in any sense of the term. So we become incredibly capable based on a tiny fraction of input data fed into us after birth.

The way the human brain and mind works is deeply tied in to the experience of having a body, knowing we are mortal, and having fundamental drives such as a drive to survive, eat, drink, keep ourselves safe, and also a drive to be social, find a mate, and procreate. I would argue that we will never be able to have a computer/algorithm that thinks like we do unless it also has drives like we do and a body like we do, since so much of our process of thinking is tied in to having a body, our awareness of mortality, and our basic human drives and experience.

X = A+B

A= C+D

B = E+F

Love = X = A+B = C+D+E+F

Obviously the above is contrived and abstracted, but you get my point. If I took a little bit of time, I can schematically map every word to makes up the definition of love, and how they interact. Then I can associate 3D, real world graphical observations to each of those words and then love as a concept holistically (as humans do, we're not just confined to text data, we observe extremely rich 3D visual data, and audio data, and touch data, etc...). There's no reason to believe a massive "correlation machine" can't do the same thing with the right algorithms, enough compute power, and multimodal inputs. Furthermore, we can make the correlation machine even better by specializing parts of the hardware for certain tasks, just like the brain.

And it still wouldn't be the same. Again, our notion of love is tied into having consciousness, which machines would not have. We still don't even understand what consciousness is, how to define it, or how it is generated in the brain. While machines are not consciousness they could never "understand" or "experience" what we call "love" because love again is tied up with our experience of consciousness, the idea of mortality, and having a physical presence in the world.

What is consciousness, and what unique non-tautological properties does it give us?

I know of the general idea of consciousness, but I can’t boil it down to first principles. Self-awareness, on the other hand, is more tangible. AI would seem capable of internal cognition, reflection on past experiences, etc...They might not have the desire or need for such reflection, but they would certainly have the ability.

That's possibly true at the 'implementation' layer, but human learning is quite different, or at elast has some aspects that are vastly different. Human manuals don't contain millions of tagged examples, they contain explanations and a few dozen examples.

So sure, when we learn how to walk or see or read, we may be in this mode of simply discerning patterns in large amounts of data. But when we learn maths or programming, we are using a completely different kind of learning.

That comparison is like comparing gardening my backyard to agricultural farming. Technically the core concept is similar, but I think we both know that the scale and depth makes them not at all the same.

There's a continuity of small changes between your backyard and industrial farming. Just like there's a continuity of small changes between you and your cat.

There’s a continuity of small changes between any two things.



For instance, there is no continuity of small changes in height that will bring you from climbing up incrementally larger trees to climbing to the moon.

The question is whether gpt-3 vs human intelligence is more like climbing a tree vs climbing a mountain or more like climbing a tree vs building a rocket.

I'm becoming increasingly convinced that Artificial Intelligence will fail not because we can't do the "artificial" part but because there is so little we do that is "intelligent".

> the fact that we're just doing predictions, with no deep meaning or comprehension of features

What - specifically - do you mean by this?

Something seems to be wrong with the footnotes on this page...

Edit: Nevermind, my browser seems to just be screwing up on Gwern's footnotes in general at the moment.

On my desktop, I hover over and they pop up a little window with the footnote in it. That looks like the intended behaviour but it must be failing on your platform.

"GPT-3 is an extraordinarily expensive model by the standards of machine learning: it is estimated that training it may require the annual cost of more machine learning researchers than you can count on one hand (~$5m), up to $30 of hard drive space to store the model (500–800GB), and multiple pennies of electricity per 100 pages of output (0.4 kWH). Researchers are concerned about the prospects for scaling: can ML afford to run projects which cost more than 0.1 milli-Manhattan-Projects? Would it be worthwhile, even if it represented another large leap in AI capabilities, to spend up to 10 milli-Manhattan-Projects to scale GPT-3 100x to achieve human-like performance in some domains? Many researchers feel that such a suggestion is absurd and refutes the entire idea of scaling machine learning research further, and that the field would be more productive if it instead focused on research which can be conducted by an impoverished goatherder on an old laptop running off solar panels. Nevertheless, I think we can expect further scaling."

Love the document format and how hovering on links presents enough detail inline (example Bitter Lessons, Sutton). I appreciate not being taken to another page/tab. Wonder if a CSS style sheet is easily available.

Had never heard of MuZero before. Its impressive that it can reach AlphaZero levels in Go without knowing the rules.

Just look at the page source for CSS and JS used to create it.

Given that Image-GPT[1] which was based on the GPT-2 architecture has been shown to do very good image completion, I think in the next 12 months we'll see a unified Image/Language GPT.

I think it will be able to (imperfectly) do things like the following:

- OCR from images - Textual descriptions of images

It may start to make some progress towards things like:

- Generating images from a textual description - producing structured documents (eg HTML) from document images

It'd be interesting to see how far along they already are with this.

[1] https://openai.com/blog/image-gpt/

I'm happy to see it being released.

What is the minimum hardware required to run this locally?

Or what's the cheapest way to run this model (even at barely acceptable performance) at a cloud provider?

Probably out of reach, but you can run GTP-2 on Google Colab: https://colab.research.google.com/github/ilopezfr/gpt-2/blob...

Any server with 512GB of RAM. So basically a couple of bucks per hour, to generate a few words per minute.

You don't need fancy GPUs with lots of RAM?

If you’re ok with generating a few words per minute, no.

All this advancement in text generation - but it you want to use it to represent sentences or documents in a semantic space you still have to use horrifically bad techniques like average or max pooling.

When is someone going to advance pooling techniques? We desperately need improvement!

Incidentally, I've also been working on a selection of GPT-3 generated creative writing (primarily, but far from limited to, poetry): https://www.gwern.net/GPT-3

Did Gwern write parts of this with GPT-3? It has a certain... flavor.

Reminds me of explorers where the alien talks in soundbites from earth TV. All hail our future plagiarist AI overlords!

What book should I read to be able to understand this text? No papers please. I don't understand what scaling a model means.

This is about new research, which mostly lives in papers and articles (like this) about the papers. It won't show up in introductory books for a while, so if you're unwilling to read or even look at papers, you won't be able to understand details of new research.

Scaling a model is just like it sounds: more data fed into a bigger network with more parameters. The gist of what this article is saying about scaling is that there's no sign of diminishing returns yet in terms of what the network can do and how well it generalises as the number of parameters is increased: the "more parameters = better performance" trend continues up to the enormous size of the full GPT-3 model, with no indication that even bigger models won't have even better performance.

Here is the GPT-3 paper: https://arxiv.org/pdf/2005.14165.pdf

If you really want to understand, skim this, and focus especially on the graphs, as they show the scaling. The x axis is usually model size, and the y axis is mostly accuracy or "loss" (~error).

The GUID Partition Table is already at its third version?

No, Generative Pre-Trained Transformer.

I know, it's just a stupid acronym because it's already used for something else (the partition table).

Oh boy, do I have bad news for you: Wikipedia lists ten overloaded acronyms starting with "AA" alone!

But they are usually in unrelated fields.

There's a lot of mixed metaphors here, but just to attack one I immediately recognize: the 1940 Scientific American article "Don't worry, it can't happen" is not (as Gwern appears to be implying?) stating that nuclear explosions are impossible. Instead, the article explains why nuclear chain reactions eventually stop, and do not continue to explode all available matter. S.A. is not saying Trinity Test would not explode, merely that it would not turn Earth into a new sun, and you can get some sleep about it.

Gwern's typographical conventions are idiosyncratic, and you might not have noticed that he linked the original Scientific American article in his text. Reading the article now, I don't see how your interpretation is defensible.

It doesn't mention exploding the earth, and while there is a little ambiguity, as Gwern does imply, they are describing a recent paper that concludes that large fission reactions are simply impossible: "They found instead, that instead of building up to a grand climax, it runs down and stops like an unwound clock."

The final line of the caption is 'Readers made insomnious by "newspaper talk" of terrific atomic weapons held in reserve by dictators may now get sleep'. At least superficially, that sure sounds to be more about atomic weapons being impossible than about whether the chain reaction would consume the entire earth.

I think you are confusing the actual linked article with Edward Teller's later argument that a nuclear fission explosion might ignite the atmosphere: https://www.realclearscience.com/blog/2019/09/12/the_fear_th....

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact