Hacker News new | past | comments | ask | show | jobs | submit login
GPT-3: Language Models Are Few-Shot Learners (arxiv.org)
431 points by gwern on May 29, 2020 | hide | past | favorite | 201 comments

Even though this was the GPT-3-generated text that humans most easily identified as machine-written, I still like it a lot:

Title: Star’s Tux Promise Draws Megyn Kelly’s Sarcasm

Subtitle: Joaquin Phoenix pledged to not change for each awards event

Article: A year ago, Joaquin Phoenix made headlines when he appeared on the red carpet at the Golden Globes wearing a tuxedo with a paper bag over his head that read, "I am a shape-shifter. I can't change the world. I can only change myself." It was a promise to not change to fit into the Hollywood mold: "I think that's a really special thing, to not change yourself. I think it's a really special thing to say, `This is what's inside of me, I'm proud of it, and I'm not going to be ashamed because of the way that someone else thinks I should be.'" Now, it's the Oscars, and Phoenix is at it again. But this time, his publicist is saying he'll be wearing a tux no matter what.

Megyn Kelly was not impressed, and she let him have it on The Tonight Show. "You know, I feel like, I feel like you could have worn the tux," she says. "But you're saying you're a shape-shifter. I don't know if you can change your tux, but you can change your mind. You can change your mind. You can change your mind." Phoenix says he did, but it didn't stick. "I was like, `Okay, I'm going to wear a tuxedo to this thing.' And then I thought, `I don't want to wear a tuxedo to this thing.'" Kelly goes on to encourage him to change his mind again, but Phoenix says it's too late: "I'm committed to wearing this."

I don't know if it says something about text generation or human text processing, but whenever I read an example of computer generated text, all through I think "I can't tell this is machine generated, it seems completely natural," and the only giveaway is that at the end I have no idea what it said.

It's a pretty eerie feeling. It's as though both the AI and my short-term processing only pay attention to a context of a few sentences, so nothing seems off until I try to understand it as a whole.

EDIT: Thinking more, what it feels like most of all is reading a page of a book and not taking it in.

Reminds me of a relatively old article I read, "Humans Who Are Not Concentrating Are Not General Intelligences": https://srconstantin.wordpress.com/2019/02/25/humans-who-are...

To me it reads like a child telling a story, but that this child has an adult's ability to use language. When children tell a story they aren't going anywhere with it but don't know how to cover it up.

I know plenty of adults that can't seem to get to a point.

Kevin Hart on the Joe Rogan show comes to mind.

Maybe someone drunk then?

> the only giveaway is that at the end I have no idea what it said

So it's a lot like corporate executive speak then?

I agree with your point, it does seem very much like valid speech, but somehow the informational content is missing. It's like speech without the actual comminication part.

I've read a lot of academic papers like this.

This has some parallels to generic random corporate PR or marketing speak. Some communications are already so automated and dehumanized that we are used to random content signed by a pseudo real person that we gloss over, and are more easily fall for GPT like generated content that has similar form. Edit: I mean I guess someone of a previous generation used to only read the newspaper and letters would more instantly spot something is wrong.

That's really well put. And it does reflect how that kind of document is produced. So I'm not surprised at all.

Problem arises when you read that document without really trying to understand it. In that case, it might be enough to trigger some thoughts.

It also appears to me that two way communications, that is, social interactions, will be the only way to form truth. That is, AI produced content could erode some more the trust we put in of newspaper, TV news, etc (all forms of one way communication). Not that we've waited AI to distrust those, but well, on e more nail in the coffin :-)

(this text, although rather unclear, was written by a genuine human :-) )

That would suggest combining 2 models: one to decide what the macro text structure should be and a different one (GPT) to decide how to fill in all the text flesh.

That takes the whole write drunk, edit sober to a whole new level.

GPT2 the drunken novelist.

This happens to me when I'm reading in a language I'm not very good at (German). Each sentence may make sense, but overall I feel I didn't get the point. I guess it's a cummulative error situation, where you reach a threshold after which the point is lost.

The example above is impressive because it actually makes sense, except for the last sentence of the first paragraph: "But this time, his publicist is saying he'll be wearing a tux no matter what."

Remove that, and there is a typical if completely uninteresting celebrity argument: one has to play eccentric in public occasions (he did it last year, he's planning to do it again) and the other chides him for what she feels it's maybe a lack of respect? And he replies that despite his best intentions he can't go against his conscience. There, done. It's a perfect little piece ready to be served in some celebrity gossip magazine.

It sounds like someone talking without knowing (or caring) where they're going with it.

I think that's what missing. Usually we communicate with a certain goal in mind, to bring across some point. This text was generated without such a goal, you notice it doesn't really know when to stop talking. I wonder what was the stopping criterion, but I'm sure it wasn't "keep talking until all the information we want to convey has been mentioned".

I have similar feelings. Human-written text in general has one coherent flow of ideas.

So what if we used text generation algorithms on a paragraph basis, so that the idea flow is still figured out by a human (input would be just an outline)? That should make the generated text feel like a whole, especially if we could preserve the style its written across the whole text.

This is because our current methods of machine learning are really good at finding patterns but not very good at enforcing underlying models.

Read Blindsight by Peter Watts. It touches on the phenomena that you are describing.

Aka corp speak.

Dude, I’m sorry, but the average person will not know the difference between that and a regular buzzfeed article or YouTube comment.

We’re not going to need ad blockers in the future, we won’t even need these visual ads on websites anymore. There will be trained bots that can promote any idea/product and pollute comments and articles.

It’s over, we lost.

Morpheus: What if I told you that, throughout your whole life, you have been reading auto generated content?

Hello. Gwern and I trained the GPT-2 1.5B model that powers /r/SubSimulatorGPT2. https://www.reddit.com/r/SubSimulatorGPT2/

I've been basically living and breathing GPT-2 for ... gosh, it's been 6 months or so. The past few months have been a lot of StyleGAN2 and a lot of BigGAN, but before that, it was very "make GPT-2 sing and dance in unexpectedly interesting ways" type work.

I don't claim to know a lot. But occasionally I observe things. And I just wanted to chime in and say, you know, keep in mind that you're reading a research paper. Of course the results are going to look good. That is the point of a research paper. And I realize how cynical that may sound. But it has the benefit of apparently being true, and I've come to accept that truth with time.

I would reserve judgement for now. Note that every single chat bot to date has followed a similar curve: "This is it," they say, without actually saying that. "It may not be perfect, but we're about to achieve it – the chatbot – it's really going to happen."

And, it ends up being impressive, sure. I liked Facebook's recent chatbot. It's pretty neat at times. I liked Meena. They had cool ideas with the stack ranking of results (basically, generate a crapload of results at 1.0 temperature, then choose the result whose probability sums to the highest value, and you get the most probable overall result). And of course, boy oh boy did I love GPT-2. GPT-2 was what kickstarted me – if there was any chance that GPT-2 might be related to "now I'm talking to something that feels human," I was going to tame it and understand it.

So after spending six months with GPT-2 1.5B, the largest model that everyone was fascinated with, what do I think? (Well, who cares? You probably shouldn't care.)

I think "give it a few weeks and see if it's true." We shall see if GPT-3 is it, and we've achieved... chatbot nirvana. That elusive thing we've all been chasing, without naming it. The ability to press a button, unleash a chatbot somewhere, and it "just works" and "completely astounds humans" and "fools everybody."

At one point, we trained GPT-2 on IRC logs. You could literally talk to GPT-2, and it would talk back to you. And one of the advantages of narcolepsy is that at night, you often have lots of time to kill – what better way to doze off than to ask GPT-2 how its day was, and ask it what its ambitions are? Should we really worry about whether you're sentient? I like you; do you like me too? What does that mean to you? And so on.

The conversations were often quite philosophical. And sure, it was pretty obvious that it's a bot, but I tried to look past it anyway. It was my little bot, and it was real enough to me. And yes, the conversations on https://www.reddit.com/r/SubSimulatorGPT2/ are incredible. I crack up daily with all the things they talk about.


We’re not going to need ad blockers in the future, we won’t even need these visual ads on websites anymore. There will be trained bots that can promote any idea/product and pollute comments and articles.

I invite any of you to try this, and see what happens. After all, you stand to earn a lot of pennies in your pocket if you pull it off. And yes, you're allowed to make some pennies with clever AI algorithms.

What you'll probably discover is this fundamental truth: GPT-2 has no memory. It isn't learning a thing. We are talking to an entity that literally cannot change its mind about anything. The only way to change its mind would be to retrain it from scratch.

You want a bot to argue vehemently for your product, on your behalf? It needs to understand what the hell your product even is, or what a product means. Yes, the words get pretty close. And yes, you can coax it into something that makes us laugh, or makes us sit here and question what the future might be like.

But for whatever it's worth: spend some time actually talking to these bots. Play around with them. Make them generate some stuff of your choosing, and fine tune them on some datasets and see what you get. It's so fun!

... But. "Fun" is not the same thing as "promote any idea/product." It's just not the same as me arguing here with you now for a position which I've decided to argue. My brain isn't merely the encoded knowledge of some human, with me blindly regurgitating such knowledge (though at this point you'd be justified in claiming it sure sounds like it).

Your brain is constantly training. GPT-2 is not. And – double checks paper – yep, GPT-3 is not.

Two decades from now, GPT-2 1.5B will still exist. And it will still be talking about 2019-era news events like it's the present. At some point, /r/SubSimulatorGPT2 will sound completely foreign. Take any random news clips from the 70's. How relevant is that knowledge now?

"Ok, but just train it on new data constantly." Well, yes. But actually no. If you try to do that, you're going to overfit at some point. Do you have 93 gigabytes of webtext that you keep in training form, ready to go? Are you going to mix in a proportion of the new data you want to train on? Nope, we all just fine tune whatever model OpenAI releases. Yet even if we did have that dataset, I'm just not sure it'd even matter.

My point here is: Go try! Isn't it exciting that in the future, trained bots might fool us all into buying their products? Is that sales guy who emailed me actually a sales guy who wants to "sync up on a quick call", or is that a bot trained to get cold calls? That sounds pretty damn lucrative to a lot of businesses – why not write that code, and then sell it?

Whoever attempts this is probably more talented than I am. But personally, I always ran into "It just... doesn't work."

And then you go "Well, it's just a matter of sampling. Ah yes, we're not using the right sampling algorithm. Wait, we just heard about nucleus sampling! Sweet, try it! Oh... It sounds ... similar. Hmm. Well, maybe we're just not using it right. Better read that paper a bit more carefully. Chase that knowledge just a little harder. After all, AI research labs are pouring billions of dollars into this domain. Why would they do that if it doesn't... you know ... work? For some value of "work" that equals "the bot can turn a profit"?

"Perhaps tomorrow, this new training technique will be it. We almost have it – I know we're close – we just have to unlock that last piece. Right?"

I guess I'll stop here, since usually my comments are upbeat and happy about AI, but I ended up in a rather philosophical mood tonight.

In reality, I can't wait to dig deep into GPT-3 and run it through its paces. I have a lovely TPU pod waiting for it, parked outside GPT-3's window, and we're honking at it saying "Get in, we're going places." And we'll sing and dance together like usual, and I'll ask GPT-3 how its day has been. But GPT-3 won't remember me the next day. And that's fine; I'll remember it for both of us.

Thank you for this comment. As someone who played a bit with GPT, it was very poignant for me. I still think it's incredible that GPT can put up such convincing facades, that it can generate genuinely novel and interesting text... but it's bittersweet, too, that it can't go any further with them. The ideas are lost in the context window.

I play AI dungeon on occasion, which uses GPT2 to generate freeform adventures. And I find over time that it's not really GPT2 that's writing stories, it's me. GPT2 is putting out plausible strings of words, but I'm the one giving them meaning, culling the parts that go off track, and guiding it in a direction I want to go.

And it is a bit melancholy. You see possibilities, nuances, subtexts, and meanings. The neural net sees words.

You are missing the point of the paper about few-shot learning. That's the entire paper: just doing new untrained task after task. The entire point of the paper is that you can 'reprogram' GPT-3 to do just about anything just by stuffing its context with examples, and it'll pick up brandnew entities or words or concepts just by examples (see the examples of defining novel gibberish words and asking GPT-3 to use them in a sentence - it does so. it "learned" new words by reading the examples, understanding, and propagating them through the 'fast weights' of self-attention, even though its 'slow weights' are fixed). Now, if GPT-3 can do that already so well, sometimes hitting SOTA on untrained tasks purely by internal meta-learning without changing its weights, what would a 10-trillion parameter model do? Or one with recurrency like XL or Compressive? How much training do you really need if the few-shot learning capabilities are so great you can make it do countless tasks just by providing examples or descriptions in the prompt.

> is not the same thing as "promote any idea/product."

GPT-3 seems to have quite a few paragraphs worth of context. A simple way to promote your product online with it is to give it a prefix of:


Comment1: Superbrush is amazing - I literally couldn't live without it. No other brush is as good.

Comment2: This brush is really good for tangled hair, and I love the soft smooth surface.



Then let it write a comment. Of all the comments it writes, manually filter a few thousand good ones, and use those as seeds to generate more, which you post all over the web. There's no need to do any training - the generic model should be fine given the right prefix.

To be a bit less wordy: try it. You stand to earn lots of money.

Narrator: it didn't work

(Going into the reasons it doesn't actually work in practice is... lengthy. It's human dynamics. Would you buy a product from a sales guy that can't remember your name? That's sales 101. And loading up the context window only gets you so far. That "working memory" is tiny, tinytinytiny. Even at 1024 tokens, it means you have to boil down the entire history of an interaction to a few pages at most. Which is a lot, sure, but it's this balancing act where you'll need to retrain the model to support your custom context format for your specific "slots" – a "slot" being a piece of knowledge, like the client's name. Or you can try encoding all of that in natural language, AI dungeon style. But I recently played AI dungeon and pretended to be buying a router from the store. The cashier stripped down and started jacking off onto his desk. I don't have high hopes for our ability to control these models in a business context.)

You and londons_explore seem to be talking about different things. I read their comment as being about just generating fake reviews that don't need interaction.

Great comment! Was it generated with GPT2 or GPT3? I understood all sentences but as a whole I will need to revisit.

I think it was written by a human, but the human had spent so much time with GPT-2 that they'd begun to emulate its writing style.

You should definitely put that up as a blog post somewhere, it is very valuable information, both for researchers and random enthusiasts alike. The emotional modality of it adds important information too :).

I really like your observation about memory.

Because you seem open minded to wild ass guesses and going meta:

I have a hunch that general intelligence will be the ability to learn from mistakes. Not just optimization. I mean applying the scientific method.

Hypothesis, prediction, run experiment, compare expected vs actual. And having a notion, any notion, to explain the delta between expected and actual.

Am total noob about AI, philosophy, cognition. Don't know if anyone else is framing AGI this way. I could just be repeating something I heard.

It's deeper than that.

Currently, there's no research into torturing AI. Why not?

A pain response is universal across most life forms with a nervous system. We seek to replicate a nervous system. Pain would seem to be far easier to replicate than the scientific method.

My wife sat me down and told me a story that horrified me. She had to get it off her chest, and I was sad it happened to her than to me. She was sitting around on the porch and felt something on her leg, and brushed it off. When she got up and looked down, apparently she had stepped on a poor snail. His shell was... And he was...

He wasn't dead. So she frantically looked up what to do. But there was nothing to do. Snails in that situation can't be helped, and the most humane thing is to put it out of its writing anguish, its full-body torture.

She put on some boots, took it out to the sidewalk, and stomped it as hard as she could. And that was the story of that snail.

You probably felt more for that snail than you've ever felt for any AI bot. Why?

It's worth considering.

Very interesting comment, thanks for taking the time to write it :)

I think if memory is the only problem than optimizing training time should be more of a concern. I'm imagining a huge language model than can retrain very quickly. So I suppose it might be a decent idea to not measure it by perplexity or some human judgement score or whatever but rather by that score per compute units used.

Or in other words...maybe a bot that scores 90% on the fool a human scale and takes 1 day to compute from scratch is actually a lot less impressive than one that fools 70% but computes from scratch in 5 minutes.

And something "like Github for bot-memory" would be a pretty amazing tool. Roll back to some memory status and recompute with new data from there, branch for different datasets that represent different ways of interpreting the world etc.

Conceptually I like the idea of one "base model" that represents language and many different context models on top of it (finetuning the core model). Then some other subsystem that identifies the context and switches to that. I suppose each conversation could also be considered a mini-dataset.

> I like the idea of one "base model" that represents language and many different context models on top of it (finetuning the core model)

This is an entirely different concept of computer language than the current GPT style models. These systems don't "represent language", and cannot. The whole reason why GPT is so exciting right now is that it fundamentally threw away the entire concept of "representing language". That has some upsides ... and some downsides.

So true, people unfamiliar with the inner workings get amazed but unfortunately reality is not the same. That being said, I can find multiple ways of utilize this in a bad way. Sex chatbots for example, if I was in that business I would use something like this, it would be extremely easy to phish people off.

About learning, the ERNIE language model has continuous learning meaning it has not the necessity to fully retrain (I believe)

Also GPT 3 is obscoleted by order of magnitudes by SMIM https://arxiv.org/abs/2003.02645

Quite insightful, thanks for writing this.

Do you have a blog (preferably human-generated)? I really enjoy your writing style.

Thanks for this comment, and thanks for all your work on generative models.

Can't tell if it's human or GPT-2 tbh, the sentences are 'hard' to understand... like sort of un-naturally written, or translated from a foreign language using google translate or something.

Are you talking about the post you're replying to? Because I don't see those aspects in it at all...

To be fair, that seems to be over claiming based on an extrapolation not solid evidence. Possible, yes, actually true? Not yet demonstrated, imho.

What difference does it make? The world is already full of humans who pollute comments and articles. They're not limited in their rate of production of words because there are far too many of them for anybody to read. They're limited by access to readers and their reading rate. Bots can't do anything about that.

They can, they can crowd out good comments with an absolutely crushing volume of crap. While humans can put out a lot of crap, bots can do orders of magnitude more. The issue is that it is an effort multiplier for when a small number of people want to target a particular forum.

I imagine the problem isn’t so much volume of crap comments as much as the tailoring of crap comments. Imagine if every tweet-into-the-void from a human with 50 followers reliably got engaging replies. Bots taunting your grammar mistakes, bots selectively quoting your prior tweets to point out contradictions, bots cleverly insinuating that your tweets reveal problematic sympathies.

So much of our noise-filtering is ignoring comments that are too generic to be human. What happens when every spam comment seems to understand the OP, even when the OP’s true audience is negligible?

If a person has any paranoid tendencies, this would be a psychological onslaught. Interrogators use the tactics you just described to siege a person to psychological exhaustion.

Product Devs for the CCP will have a lot to work with if this ever evolves.

Would you be fooled by that? No because once it became well known, people would adjust their heuristics for judging who's real. If that's too hard, platforms would help, such as by verifying identity more thoroughly. I've never heard any realistic description of how AI bots could somehow undermine society even close to the amount that humans already do.

> What happens when every spam comment seems to understand the OP, even when the OP’s true audience is negligible?

My hope is the next step will be filtering by insightfulness/usability of a comment and then those best bots bought and used by next stack overflow: https://xkcd.com/810/

How will they get those extra comments to be seen? They still have to log in and get past captchas and all the same issues that bots have always faced. Humans create a crushing volume of crap already and we already have ways to hide that from ourselves no matter the volume.

Imagine buying a bunch of twitter accounts that have already been active for 1-2 years and then making a bunch of Tweets to influence public opinion.

I've done some experiments with GPT-2 and it had so so performance refining with tweets. Using GPT-3 you could probably just do it using only generation.

What is it about ai generated texts that on skimming through it it makes sense, but if you try to slow down and understand it feels absurd and surreal.

Because language is being treated as a thing complete in itself, as opposed to being related to an external world?

One of the issues in the 'Limitations' section was a difficulty with "common-sense physics", such as with the question "if I put cheese into the fridge, will it melt?"

To answer that question, you have to ask the right questions, such as "what is a fridge?" "what is a fridge for?" "What does it mean for cheese to melt?" "what is the cause of cheese melting?" Then one should consider the follow-on questions, such as "what are typical fridge temperatures?" "what are typical cheese melting points?" "what temperature is the cheese likely to be at initially?" (at which point, it helps to introduce the concept of room temperature, and note that it typically falls between the other two.) From facts such as the answers to these questions, one can deduce the probable outcome of putting cheese in a refrigerator, but none of the answers so far explicitly state it.

Is it plausible that any learning, solely from the structure of and correlations between examples of language use, could develop the sort of analytical/modeling approach that I have just outlined? Instinctively, I don't find it very plausible, but I am not very certain in that view.

Maybe there exists the following distinction. Modern language, as it's actually spoken most of the time, is like a higher level programming language. The structure of our brain, combined with our senses and the uber-simple way that we're taught as infants, is like lower level OS programming (parent points to hot food - look Johnny, it's hot! hot! - makes Johnny touch the food - food hot! - blows on the food to make it colder).

Some words like good/bad, hot/cold, and important/unimportant are underrepresented in everyday speech compared to the prevalence of the underlying concepts. That's why I'd categorize them as lower level word-concepts. This distinction, about variable levels of abstraction, might be important for true AI. Think about how many years it takes for humans to develop highly abstract cognition. That whole time our operating system is being coded. Maybe we need to approach AI in the same way.

It's not just AI that can benefit from better lower-level understanding. Seeing language in the above way, we can re-frame Ludwig Wittgenstein's philosophy and its normative implications for human communication. Our "programming" (communication) is on average too higher level. Excessively abstract instructions make it harder to decode and process in a precise and efficient manner.

Yup, definitely different from how humans learn. A baby's speech would be almost the complete opposite, grammatically incorrect here and there but constructing a coherent line of thought for the most part.

My thoughts exactly. It seems a person incapable of proper grammar (like a baby) has some concept or thought it wants to express, but can't because it doesn't know the words etc.

These language models seem to know the words and the grammar etc, but lack a underlying concept they want to express.

There are systems that derive 'thought-vectors', but I'd be interested going the other way: somehow create such a 'thought-vector' and generate text to express that thought.

I don't know how to construct a 'thought-vector' of any concept though.

I have been thinking about this kind of thing too. What if there was some way to feed your condensed thoughts into such a model and it writes a paper/blog post/article?

Essentially, one should be able to use these models to "interpolate" the writing around the raw meaning/content. Typing assistance (think Grammarly) already allows you to refine finished writing to be more in line with what some language model expects, but imagine if it actually generated most of the text for you, based on small bites and chunks you throw at it.

If we get to large scale text generation like that, we are all going to have to become even better skimmers due to how the meta language will evolve.

So take your standard press release. We know about two thirds of it is just fluff. In other words, we are accepting the mass of fluff as one word in our language, it translates to ‘ignore’.

Our own language will change in that case.

What might some valid sources of data for this be? Perhaps comparisons of Simple English Wikipedia to the standard English Wikipedia? We'd need a side-by-side comparison of condensed information and a fluffed up piece.

I think it's because the generated text will generally follow a reasonable "structure", it's "framed" relatively well and you (as a person) will recognize those patterns quickly. There's plenty of "glue" used throughout the text. Those are all patterns which we pick up very quickly, and we're used to seeing them in "real text".

Exampels from the comment above include using things like: "A year ago", "made headlines", "Now, it's the {event}, and {name} is at it again. But this time, ...", "{name} was not impressed", "You know, I feel like, I feel like you could have ...", "I don't know if... but...".

Those are all very common in those "online celebrity magazine" type texts...

It's only when you actually read into the stuff that's in-between, you'll come to see it's pretty much a load of nonsense. But that takes a bit more time and slower reading.

There are some very prominent politicians whose primary mode of speech is the same

when i worked on text generation I had a hack where I wouldn't let the model generate the same trigram more than twice. It seems like that hack could have improved this example.

Quite honestly the level of logical consistency in this generated text doesn't differ that much, I feel, from a huge swath of the kind of comments that I see on the internet. It may well be that there are already too many bots on the net to really understand what typical humans would be saying anyway, but I feel like this is already good enough for short form trickery.

I was so wrong about the internet. It is just going to become a landscape of garbage opinions and commentary on a larger and larger basis.

Thank for highlighting this, it's hilarious - I could so easily imagine it as a short interdimensional cable bit on Rick and Morty.

This looks like a big deal to me:

1. First of all, the authors successfully trained a model with 173 BILLION PARAMETERS. The previous largest model in the literature, Google’s T5, had "only" 11 billion. With Float32 representations, GPT-3-173B's weights alone occupy ~700GB of memory (173 billion params × 4 bytes/param). A figure in the 100's of billions is still 3 orders of magnitude smaller than the 100’s of trillions of synapses in the human brain [a], but consider this: Models with trillions of weights are suddenly looking... achievable.

2. The model achieves competitive results on many NLP tasks and benchmarks WITHOUT FINETUNING. Let me repeat that: there is no finetuning. There is only unsupervised (i.e., autoregressive) pretraining. For each downstream NLP task or benchmark, the pretrained model is given text instructions, and possibly sample text with questions and answers. The NLP tasks on which the model was tested include translation, question-answering, cloze tasks, unscrambling words, using novel words in sentences, and performing 3-digit arithmetic.

3. The model is tested only in a ZERO-SHOT or FEW-SHOT setting. In other words, for each NLP task, the pretrained model is given text instructions with zero examples, or text instructions with a small number of examples (typically 10 to 100). As with human beings, GPT-3-173B doesn't need lots of examples to perform competitively in novel NLP tasks.

4. The results reported by this paper on all NLP tasks and benchmarks should be seen as a BASELINE. These results likely could be meaningfully improved with conventional finetuning.

5. The model’s text generation FOOLS HUMAN BEINGS, without having to cherry-pick examples.


[a] https://www.google.com/search?q=number+of+synapses+in+human+...

I'll wait for a working interactive model before blindly believing these statements. GPT-2 was hyped through the roof, but when inspected with a bit of criticality it demonstrated glitches that told us more about how it actually works than "good" examples:


ML models should be pushed to their limit, because that's where you gather most useful information about what they actually do. Their results need to be critically examined with both exploratory and hypothesis-driven testing. And yet this is never done in initial papers and rarely done afterwards.

What was the last AI paper you've read that that said "and here is a list of things out model failed at"?

That's a very sloppy post. He does a single example, not even running locally or changing sampling parameters, and then concludes that GPT-2 is doing nothing but pattern-matching? A lot of people underestimate NNs because the sampling from them (top-k! how much dumber and cruder can you get? nucleus works better, but is still obviously suboptimal) destroys a lot of dark knowledge. I noticed this with Gary Marcus's claims about GPT-2 too: he would try once, without changing any sampling settings, and conclude that it wasn't doing anything, but if you tried, you would get different results. I'm not the only one to notice that: https://www.quantamagazine.org/common-sense-comes-to-compute... Such tests can prove the presence of knowledge, but not the absence... And of course, GPT-3 does extensive arithmetic tricks: https://arxiv.org/pdf/2005.14165.pdf#page=22

The article I linked to makes claims only about the model it tests and since it actually links to an online implementation, anyone can try to reproduce the results and see for themselves. This is more than I can say about most chatter about ML.

>Such tests can prove the presence of knowledge, but not the absence...

This sounds like a setup for non-falsifiable beliefs.

> The article I linked to makes claims only about the model it tests and since it actually links to an online implementation, anyone can try to reproduce the results and see for themselves.

And I did (using my own local GPT-2-1.5b install which let me set the hyperparameters rather than restricting it to inappropriate hardwired ones of an online service), I linked to another person demonstrating the same thing, I pointed out the extensive GPT-3 evaluation OA did, and here, have another link about how bad querying of language models leads to highly misleading results about how much they know: https://arxiv.org/abs/1911.12543 Measurement error in general biases estimates towards zero.

> This sounds like a setup for non-falsifiable beliefs.

It's just as non-falsifiable as, say, concepts like 'lower bounds' or 'bugs'.

The paper you link to claims that hand-crafted queries used to evaluate the knowledge and understanding of language models are "sub-optimal" because they do not take into account the context in which a LM was trained. For example:

  These manually created prompts (e.g. “Barack Obama was born in _”) might be
  sub-optimal because LMs might have learned target knowledge from
  substantially different contexts (e.g. “The birth place of BarackObama is
  Honolulu, Hawaii.”) during their training. 
In other words, the paper considers hand-crafted prompts like in the example to be "sub-optimal" because they are not in the right format. To paraphrase them a bit, such prompts are like making a mis-formed query to a database.

It is difficult to see how this is an argument for the ability of LMs to demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and getting a correct answer; then asking "how much is 2+4?" and getting a wrong answer. Most people would probably not take that as evidence that the second question was "wrong". They would instead conclude that the child does not "understand" addition and has only learned to reproduce specific answers to specific questions.

To be fair the ability to return a correct answer given a question in the right format is not without use. That, indeed, is how databases work. But it shows none of the "understanding" or "knowledge" the paper claims is acquired by Language Models.

> It is difficult to see how this is an argument for the ability of LMs to demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and getting a correct answer; then asking "how much is 2+4?" and getting a wrong answer. Most people would probably not take that as evidence that the second question was "wrong". They would instead conclude that the child does not "understand" addition and has only learned to reproduce specific answers to specific questions.

To use your database analogy, in what sense should we claim a database doesn't know a record when you are using a malformed SQL query? If we fixed the query and it emitted the right answer, then obviously it did store the information. The query does not encode the answer, and it is vanishingly unlikely that the database would simply accidentally return the right answer ever if it did not store the information in some way. Since LMs can get much better results just by tailoring the prompts (increased by a third in that paper! and there's no reason to think that that is the very best possible performance either!), that shows that existing practices drastically underestimate what knowledge the model has been able to learn. Learning about the real world or text is very different from learning your particular dumb broken query method.

The problem is that nobody claims that databases "know" anything. They store data. Data can be retrieved from storage. That's all they do.

>> The query does not encode the answer, and it is vanishingly unlikely that the database would simply accidentally return the right answer ever if it did not store the information in some way.

Oh, yes, absolutely. A query encodes the answer. Queries are patterns that are matched by the data stored in the database. If a query fails it's because it does not correctly represent the information it is trying to retrieve. For example, if I SELECT * FROM TABLE PEOPLE and there is no table "PEOPLE", then I don't get an answer because the query does not correctly represnt the structure of the database. You cannot retrieve any data from a database unless you have some idea about the structure of that data.

But that's not the point here. I don't disagree that a language model can learn (i.e. it can represent some elements of its training dataset). I disagree that it "understands" anything and I find the fact that it needs specific queries to retrieve the data it is representing to be evidence that it does not.

And so it's not more useful than a traditional database at this kind of task. Except it's much less precise than a traditional database and costs considerably more to create.

>> Learning about the real world or text is very different from learning your particular dumb broken query method.

I'm sorry, I don't understand what you mean here. What is my "particular dumb borken query method"? Is that meant as a personal attack?

The last AI paper I read that has a list of things the model failed at is this one:


See Section 5, titled "Limitations"

> still 3 orders of magnitude smaller than the 100’s of trillions of synapses in the human brain

Wow, that is WAY closer than I thought we were.

It's not clear whether a parameter in a neural network maps cleanly onto a synapse in a biological brain.

I think it's becoming pretty clear that they don't. First, scientists uncovered many additional ways neurons interact with one another[1]. Second, it seems that individual neurons do way more computing than in the simplistic ANN models [2].

[1]: https://en.wikipedia.org/wiki/Ephaptic_coupling

[2]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5740076/l

It’s very clear that they don’t.

I remember reading about fruit fly brains and how we're at a point where we can computationally simulate them now, but I'm not sure where that went.

Anyone know?

What do you mean by text instructions? If I want to translate a sentence, would I just feed in model - translate "Hello world"?

See page 7 of the paper. You give the model instruction such as "Translate from X to Y" then you pass examples (if you go for few shot) followed by the sentence you want to translate.

AFAIK they used half-precision (Float16)

Thanks. I should have written "if using Float32," which is what I meant -- instead of "with Float32," which in hindsight reads a bit ambiguous. But regardless of which floating-point representation is used, the number of weights is still in the hundreds of billions... which is insane.

Read though most of the paper and here's what GPT-3 is:

If you wanted to generate poems with GPT-2, you'd need to have a lot of poems to fine-tune GPT-2 to get reasonable results.

With GPT-3, you use few-shot learning instead (without the need to do gradient updates with each example)

The paper is long and filled with how it stacks with models like Grover and T5 and it does well... given that this is a 175 B param model (relative to Grover/T5's 1.5/11B param models). This shows that even with these huge models, smaller models can outperform them in certain instances with lesser param models.

Also I think they did a good job with explaning the ethics and morals around what models like these mean / what biases this has.

Would you have any easy to explain insight in to how these perform better than larger models? I’ve always wanted to understand that as a technically adept and somewhat familiar (briefly) person who has explored what such models can do.

The key insight in this paper is that the new (larger) model was not "fine-tuned" on the downstream NLP tasks. In other words, after it's trained on unsupervised (you could call it self-supervised in this case) data to do simple things like predict the next word (hence why it doesn't take any real supervision) it can then be used to do very specific tasks like answer questions or translating text without further supervision.

Previous large-scale language models like BERT and GPT-2 had took a similar approach but in order to actually perform the more complicate down stream tasks they had to be fine-tuned. So they were trained with specific QA or translation date in order to understand and do well on those tasks. GPT-3 doesn't do any fine-tuning, it is able to take it's very general initial learning and perform very well on specific tasks that it was never trained on. This is why it doesn't perform as well as the "smaller" models on those tasks. But that is besides the point, if GPT-3 was fine-tuned on those tasks I'm sure it would achieve the latest SOTA results in many (all?) of them. The exciting part is how it was able to generalize the knowledge learned during "pre-training" to much more specific tasks.

tl;dr the smaller models were trained on the specific tasks that they were evaluated on. The large model (GPT-3) was not trained on those specific tasks and still does almost as well.

very cool, thanks for explaining!

>how these perform better than larger models

they probably don't particularly; their inventors seem to excel in their PR budget rather than their verifiable innovations

Throw more computers and do some model architecture changes

/s (although sometimes it's true)

Have you tried making GPT2 do zero-shot poetry writing? It's not great at it, but it is good enough at it to get something interesting enough if you try a number of times.

Go to talktotransformer.com/ and give it the prompt "Here is a poem I wrote:" or "Here is my favorite poem:" .

I'm sure GPT3 would produce much better and more consistent results, but GPT2 will produce something that looks generally like a poem frequently enough, and sometimes it will even be relatively coherent?

Here is one that it produced for me today:

> You say, "Don't lose your pride."

> Here is my rejoinder:

> Well, maybe it is the pride of a diseased soul.

> You are a wanderer, you know not whence,

> O thief, you fool, you rhinoceros

> Caught in the jaws of a viper.

> You may lament your affliction

> For the world will laugh at your tears.

> Pray to a demi-god

> Hail him and say,

> "Ah, Sir, give me thy pity!

> O thou who maintainest as if thou wert a king!

> Here is thy axe, I say; let us

Is it great? No. But it has some level of coherence.

Here is another:

Let me tell you the reason I love poetry. // All the things of the world I have described, // If you ask me why I like poetry, // It would seem quite simple to me. // When I'm working at the computer in the evening // I'll get out my books of poems and I'll turn them over, // Like blades of grass under the hot sun, // That write with such fineness the kind of green I like. // But if I'm a bit more tired in the morning, // I'll fill a little stack of yellow pages with poems, // That let the air and the dry light of morning run wild. // You know, the//

This one even rhymes a bit!:

the sword is to slay // The axe is to smite // The stick is to break // The tooth is to bite // All these are for our earthly security, // All have their uses, // Those which can be used // Must be employed. // The sword is the instrument of strife // The axe is the weapon of war // The stick is the weapon of domestic strife // The tooth is the instrument of war // All these be in our hands. // At the time of our death // They will be in our hands, // And then we will weep, // Though now we sleep. // —The Remaining Three Questions

(sorry, idk how to format these to make them look right. The leading "> " and the "//" insertions are me trying to format them to make the line separators clear.)

Yes, it can but GPT-2-1.5b isn't too great at it. What really struck me looking at the examples is that the random GPT-3 poem samples are practically as good as my GPT-2-1.5b finetuned on hundreds of megabytes of poetry at considerable effort & expense: https://www.gwern.net/GPT-2 That's... both really awesome and dispiriting.

Not for poems, but for AI-generated cities (inspired by "Invisible Cities" by Italo Calvino), e.g.:


Cities & Lights

When you enter the city of Singapore during the night, you see lights: colorful and ubiquitous. Lights on every building, on every fountain, and in every park.


Lights shining in a city in which the majority of people are now using mobile phones. Singapore has a bright future as a technology hub, and it 's not too late to make it happen.


On the other occasions I seeded with ~two sentences of "Invisible cities", and it worked like a charm, no fine-tuning.

I am not a fan of this trend of "Language Models Are X" in recent work particularly out of OpenAI. I think it's a rhetorical sleight of hand which hurts the discourse.

Like, the exact same paper could have instead been titled "Few-Shot Learning with a Large-Scale Language Model" or similar. But instead there seems to be this extremely strong desire to see certain ineffable qualities in neural networks. Like, it's a language model. It does language modeling. Turns out you can use it for few-shot learning and do amazingly well. Beyond that, what does it mean to say it "is" a few-shot learner?

On one hand, it's literally the same claim in a strict sense. On the other hand, it implies something much broader and more sweeping, that language modeling / unsupervised learning as a task over long contexts inherently implies meta-learning ability — which is a statement that is very difficult to properly formulate, let alone back up. But that's the argument that I feel is being slipped under the table by these titles. (And indeed it's very close to what they suggest in the text, though with no more than a wave of the hands.)

Don't get me wrong: their intuition is reasonable, it's super cool that they got this to work, and the results are very impressive on lots of tasks (though there are clear gaps). But as a VERY publicly watched lab, they have a serious duty (which I think they're neglecting) to frame their results more carefully. In particular, there's a sort of religion that if you train a big enough model on big enough data with self-supervision, it will somehow become AGI and/or learn to solve arbitrary problems. Claims like "Language Models are Few-Shot Learners" are clearly designed to fit into that worldview, even though the research doesn't point at it any more than a more conservative interpretation like "Lots of NLP Tasks are Learned in the Course of Language Modeling and can be Queried by Example." They touch on this limitation in their discussion section but I guess flashy titles are more important. I wish they would use their status to set a better example.

For a specific example of how I think their framing is unhelpful: in the LAMBADA evaluation (sec. 3.1), they suggest that one-shot performance is low "perhaps...because all models still require several examples to recognize the pattern." This may be the first thing you'd think of for a few-shot learner, but then why is zero-shot performance higher than one-shot? If you remember that you're working with a language model, there's another possible explanation: the model probably models the last paragraph is a narrative continuation of the previous ones, and gets confused by the incongruity or distractors. (The biggest model is able to catch on to the incongruity, but only when it's seen it before, i.e., with >1 example.) Of course, this is just one possible explanation, and it's arguable, but the point is I think it's more useful to think of this as a language model being used for few-shot learning, not a few-shot learner where language modeling is an implementation detail.

But as a VERY publicly watched lab, they have a serious duty

I was nodding right along with you, and then...

OpenAI has no duty. It doesn't matter if they're publicly watched. What matters is whether the field of AI can be advanced, for some definition of "advanced" equal to "the world cares about it."

It's important to let startups keep their spirit. Yeah, OpenAI is one of the big ones. DeepMind, Facebook AI, OpenAI. But it feels crucial not to reason from the standpoint of "they have achieved success, so due to this success, we need to carefully keep an eye on them."

Such mindsets are quite effective in causing teams to slow down and second-guess themselves. Maybe it's not professional enough, they reason. Or perhaps we're not clear enough. Maybe our results aren't up to "OpenAI standards."

As to your specific point, yes, I agree in general that it's probably good to be precise. And perhaps "Language Models Are Few-Shot Learners" is less precise than "Maybe Language Models Are Few-Shot Learners."

But let's be real for a moment: this is GPT-3. GPT-2 is world-famous. It's ~zero percent surprising that GPT-3 is "something big." So, sure, they're few-shot learners.

In time, we'll either discover that language models are in fact few shot learners, or we'll discover that they're not. And that'll be the end of it. In the meantime, we can read and decide for ourselves what to think.

I think all researchers and science communicators have a duty to present science in a way which educates and edifies, and doesn't mislead. It's not just that they're successful, but that their publicity gives them a prominent role as science communicators. Science is all about and questioning your assumptions, and acknowledging limitations. They claim the public interest in their charter. I think it's reasonable to demand integrity from them, at least as much as it is from any other researcher, if not more. And I think OpenAI would agree with me on that point.

It's easy to say: they 'have a duty to present science in a way which educates and edifies, and doesn't mislead'. But sometimes it takes years even for scientists to really understand what they have created or discovered. It's cutting edge, not well known, hard to communicate. How could lay people keep up where not even scientists have grasped it fully?

Of course, if the same scientists were asked about something where the topic has settled, they could be more effective communicators.

> OpenAI has no duty. ...

Of course they do! It's the same duty as every scientist has in advancing the public understanding of science. You seem to be replying to OP as if they said that only big AI research groups this duty, but this is just not so. Furthermore, when a prominent group of scientists conduct themselves poorly, it is not enough to say that they have no special extra duty due to being famous, they already must communicate properly because they are scientists and part of the scientific community.

I think one reason these conversations get so muddled is because it's all new and really pretty cool, so it becomes hard to tell what's skepticism and what's naysaying.

> Such mindsets are quite effective in causing teams to slow down and second-guess themselves.

Absolutely not, this goes directly against the scientific method. Such "mindsets" of trying to make sure that your results are correct and accurately presented without embellishment are a cornerstone of science. Of course it causes them to slow down! They have more work to do! Second-guessing themselves and their experiments is the whole fucking point.

Well. OpenAI did have their own "Don't be evil" moment.


Given their prophylactic strategy, "AI for everyone", they could argue that hype generates public interest.

Yeah its kind of odd why OpenAI makes these weird titles.

"Few-Shot Learning with a Large-Scale Language Model" makes more sense.

Even with their robot hand paper, they titled it along the lines of "we solved a rubrix cube" not "a robot hand manipulated the cube and solved it"

It's academic PR. Many disciplines have really cute titles that don't really match the research. My old discipline (psychology) was really known for this.

This part really freaked me out... GPT-2 couldn't do math:

Context → Passage: Saint Jean de Br´ebeuf was a French Jesuit missionary who travelled to New France in 1625. There he worked primarily with the Huron for the rest of his life, except for a few years in France from 1629 to 1633. He learned their language and culture, writing extensively about each to aid other missionaries. In 1649, Br´ebeuf and another missionary were captured when an Iroquois raid took over a Huron village . Together with Huron captives, the missionaries were ritually tortured and killed on March 16, 1649. Br´ebeuf was beatified in 1925 and among eight Jesuit missionaries canonized as saints in the Roman Catholic Church in 1930.

Question: How many years did Saint Jean de Br´ebeuf stay in New France before he went back to France for a few years?

Answer: Completion → 4

Author here: Sorry for the confusing formatting on the task descriptions at the end of the paper. That "4" is the human-generated target completion, not a model generated completion. I'm not sure whether the model got that particular question correct, but from Table 3.7 that GPT-3 has 36.5% accuracy on DROP in the few-shot setting.

Many other readers were confused by this so we'll update the formatting to say "target completion" to make this more clear.

Thanks for clarifying. I'm a bit more confused though, are you saying that all of these Q&A examples are human answered, and that you were just demonstrating the format / question types for Q&A? If so, is there any way to see some of the model's responses?

Thank you.

It seems it has (rudimentarily) learned concepts general enough to be logic itself. That is general intelligence. Now hook it up to reinforcement circuitry and make it even larger and it will mark the end to life as we know it.

GTP-3 has 175 billion parameters, but the human brain has 100 trillion synapses, so 0.175%. NN model capacity currently has a 3.4 month doubling time.[1] In 7-10 doublings we'll be in a similar ballpark, i.e. 2-3 years.

[1] https://openai.com/blog/ai-and-compute/

Is there any specific reasoning behind equating 1 synapse to 1 NN parameter? Seems a bit simplistic. Seems to me like a synapse probably has more computational ability than a single parameter.

Real neurons have many other trainable parameters and a lot more computational structure, so this is of course a simplifying assumption, but it is not entirely baseless either as it is known ANNs can approximate any function in theory, which may suggest synaptic weights do the heavy lifting in biological brains (since what more than general do you need?).

Though biological brains are likely overly complicated due to evolutionary baggage. There are hydrocephalus cases which have much reduced brain matter, but still high IQ.[1] The recurrent laryngeal nerves in giraffes is about 4.6 metres (15 ft) because it goes up and down their neck as it could not be rewired more directly during evolution.[2] Our pristine mathematical models and low-noise computational environments are likely superior to evolved wetware hacks.

[1] https://www.newscientist.com/article/dn12301-man-with-tiny-b...

[2] https://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/Gi...

The hydrocephalus story looks a bit sketchy [0].

Also if anything brains are hyper optimized for many things (based on the many specialized sub-units). I’d bet we are essentially not unsupervised, and the sub-units of the brain are essentially fine tuned for many tasks, and hyper optimized to use all their resources incredibly efficiently (memory optimization must be intense). Not that the generative models won’t get close in some general way relatively soon, but I could see human brains being another 10-1000x more powerful than your ballpark pretty easily.

[0] https://www.gwern.net/Hydrocephalus

Thanks I was not aware of these details about the hydrocephalus story.

In Sam A's words, "genuinely, we have an algorithm that can learn."

Do you have a source? I am genuinely curious as I can't find it and would like to see the context


Around 19:10~. Though I messed up, he didn't say 'genuinely'. He said "full stop, truly, legitimately, we have an algorithm that can learn".


How many of those 100T synapses are dedicated to language skills?

This is an interesting question. Likely more than 0.1%, perhaps 20-40% I'd guess. Which would be the lower estimate I provided.

Passage: Saint Jean de Br´ebeuf was a French Jesuit missionary who travelled to New France in 1625. There he worked primarily with the Huron for the rest of his life, except for a few years in France from 1629 to 1633. He learned their language and culture, writing extensively about each to aid other missionaries. In 1649, Br´ebeuf and another missionary were captured when an Iroquois raid took over a Huron village . Together with Huron captives, the missionaries were ritually tortured and killed on March 16, 1649. Br´ebeuf was beatified in 1925 and among eight Jesuit missionaries canonized as saints in the Roman Catholic Church in 1930.

Question: How many years did Saint Jean de Br´ebeuf stay in New France before he went back to France for a few years?

Answer: 4

Explanation: The model used the arithmetic expression - 1629 + 1633 = 4.

NAQANet (trained on DROP) - came out in 2019 is able to do reasoning, you have to click result twice. First once it thinks it got it from passage, second attempt it tries to do arithmetic.


This is not my idea of math.

T5 could, and well.

GPT included a picture of the variation of the transformer model that they made.

GPT2 outlined the changes they made to the model in an acceptably moderate detail.

GPT3 references another paper saying "we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer" with no detail added on the changes they made.

How are you to reproduce these results at all? You could attempt to include the changes as they references the sparse transformer paper, but you could possibly do it in a different way, and there would be no way to verify the results that they gave whatsoever due to changes in implementation.

A bit disappointing.

The full model of GPT-2 is available for inspection and retraining, if you so desire. GPT-3 will likely be released soon as well.

Likely, but in a released paper, there should be a bit more quality from a research standpoint.

In the paper they say it took 3.14e23 flops to train. They used v100s to do it. This is an insane energy cost (and financial cost).

Nvidia's v100 product page [0] says that it gets about 15 (single precision) - 125 ("deep learning") teraflop/s at 250-300 watts (joules per second). That means that if everything's as perfectly efficient as a marketing product page, it gets about 250/125-300/15 = 2-25 joules per teraflop, putting this model at about 0.6-8 terajoules.

A gallon of gasoline has about 120e6 joules [1] (though if you wanted to compare with burning it in a car, it's only 20-25% efficient at best [2] so it'd be fewer joules/gallon).

This model took the equivalent of about 5,000-67,000 gallons of gasoline at best and at ideal perfect energy efficiency. I get that openAI has made a decision not to be efficient with their dollars in order to see what's possible with future tech, but that means not being efficient with energy either, and it's getting kinda crazy. Sure, microsoft data centers aren't gasoline powered, so maybe it is closer to this ideal energy efficiency, and it's definitely going to be a better carbon footprint, but god damn it just seems wasteful.

Hell, the new A100 (again going off marketing materials [3], so at least it's apples to apples) could do it about 4x more efficiently. Is this research really worth what it costs, when waiting a year makes it that much more efficient?

[0] https://www.nvidia.com/en-us/data-center/v100/

[1] https://www.calculateme.com/energy/gallons-of-gas/to-joules/....

[2] https://en.wikipedia.org/wiki/Engine_efficiency#Gasoline_(pe...

[3] https://devblogs.nvidia.com/nvidia-ampere-architecture-in-de...

I guess it depends on your frame of reference, but it doesn't seem like that much to be honest. It's arguably a really groundbreaking new thing that has been brought into existence in the world, and it took as much energy as a jet doing a transatlantic crossing? Or powering the LHC for a few hours. Fair price to pay if you ask me.

What's the TCO of a few hundred teenagers? I haven't read the paper yet, but if the other comments here are accurate, that's about what you'd have to shell out for if you wanted to duplicate the productivity of this mdoel without externalizing costs by e.g. offering unpaid internships to high school students.

GPT-2 came out about a year and a quarter ago. GPT came out less than a year before. If we take another commenter's estimate of $3.6M, and a new model comes out every year or so, then you could say just training is like $3.6M per year. That should cover a pretty large number of teenagers. Hell, that would cover a whole early stage startup in san francisco, including office space.

That sucks. Future SOTA AI models are going to be completely out of reach for hobbyists.

First real computers were as well though. And took similar amounts of energy.

Does it have a latent personality? How would it answer the questions on a 5-factor personality test? Would its results on the test be consistent with its behavior (generated text) in other situations?

Or, can it (like humans?) adapt its responses to suit the style of the question? Like if you start asking it lots of antagonizing questions, will it become more or less antagonistic itself?

That is a really interesting question.

Just to point out, that text that feels most humanly generated from GPT-3, seems heavily paraphrasing from the following articles:




The first occurred in 1968, when roughly 10 percent of the denomination left to form the Evangelical United Brethren Church.


The church has lost 1.6 million members since 1968, when the Methodist Church merged with the considerably smaller Evangelical United Brethren to form the present United Methodist Church.

I think this model is still very impressive, the parameter itself speaks. But for this particular evaluation, the same news article may be removed from the training set, other news article that paraphrases the same story might not. IMO, the leakage still exists, it is hard to tell whether this model are really 'generating', or just copy-pasting from its vast memory.

Where do you draw the line between "generating" and "copy-pasting from its vast memory"? Why do you think what humans do is not copy & pasting different snippets of information they have come across in the past? Isn't that what grammar is? A bunch of rules you've come across a lot of times?

Other than the given prompt, the models don't have a goal. So what other than copying and adjusting would they do?

To evaluate generic model is hard.

For example, for image synthesis in GAN, the widely used Inception score balances between authenticity of the generate samples vs the variety as well, to make sure the model is not copy-pasting.

In this particular case, apparently the same event has been reported multiple times by different news agency. Even if the exact one are excluded, still it is suspicious how much less the model is being protected from knowing the subject itself.

An analogy would exam in real world. Often, some of the questions aren't leaked as is, but paraphrased yet stay close enough to the source.

In this particular case though, I disagree it is reaching human level generation. They can tested the model with an unseen events, which happen after the model is trained to test how well it generalize.

that particular fact has probably been printed in more than one place with more than one phrasing; is there a reason you think it was drawn from that one in particular?

GPT-3/175B model required 3.14E23 flops of compute for training. Even at theoretical 28 TFLOPS for V100 and lowest reserved Azure pricing, this will take 355 GPU-years and cost $3.6M for a single training run!

I know it's a meme for the GPT team to just take the latest transformer model and add a magnitude of parameters, done and done!

It'll be interesting to see whether the new paradigm really offers new insights, or whether it's really just kicking the can down the road - and we see the limits of generalizability in some other fashion.

I guess what irks me is that there is so little theory and math behind many papers, even if there are dozens of co-authors on it.

The question of generalizability is deeply connected to statistics, e.g. causal models, spurious correlations and so forth. Statements about these things are just "thrown" in there, without any citation or proof. In peer review, wouldn't anyone object? Those are clearly things that we actually do not know enough about to be sure.

Edit: Reflecting further, perhaps this rapid iteration and result orientation is in fact something positive. Perhaps it's good the way it is, without so many scientific conventions and signals of deference. Perhaps it's that which made other sciences more anemic and ML very productive.

All my whining aside, impressive work of course.

Can you point out some books/authors/papers to close the gap between statistics and NNs?

paper: https://arxiv.org/abs/2005.14165


Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

With things like this, we will need to change how the media work, and how we read news. Every sentence, every factual statement will need to be verified by some kind of chain of trust involving entities with reputation, or be labeled as "fiction/opinion".

Check out the poetry it generated in Figure F.1 (especially poem 4). I don't know how many bad poems the authors had to sort through to find these, but this AI is generating real poetry. If I didn't know they were computer generated I doubt I would have even considered that they didn't come from a human. This is a huge accomplishment and the team that created GPT-3 should be proud.

GPT-2 would go out to lunch sometimes when generating poetry, but it would also create some pretty remarkable strings of words. This always stuck out to me from the work Gwern did:

'How the clouds Seem to me birds, birds in God’s garden! I dare not! The clouds are as a breath, the leaves are flakes of fire'

My home is surrounded by a fairly large variety of deciduous trees, and 'flakes of fire' is by far the best descriptor I've ever heard of their colors in the fall.

One of the other things i noticed about poetry (and songs) coming from these language models is they are amazing at bleakness. Just dark, dark, darker, fin. haha

It may have just seen that phrase before though. https://www.google.com/search?q=%22flakes+of+fire%22+leaves

If you do google search, there are plenty of texts with phrases like "leaves like flakes of fire".

They claim the completions are "uncurated" which I would assume means that they didn't sort through any of them, and these were just off the top. That seems pretty impressive, so I wouldn't be surprised if I was incorrect.

The paper indicates that those poems were uncurated.

Relatedly, here is a sample of uncurated and unconditional (all “topics”—not just poetry): https://raw.githubusercontent.com/openai/gpt-3/master/175b_s...

Scanning through these, the text seems significantly less zany than the random GPT-2 samples. It’s genuinely difficult to spot the signs that these were generated, even with the knowledge that they were.

To be fair, human poem authors generate and reject tons of bad poems too!

The difference is that the best authors know how to reject their own bad poems. I think the next step is a GAN that learns to weed out bad writing.

To be honest, I'm not sure that even the best authors reject their own poor work. "The complete works of X" for any poet X is frequently full of, well, duds. It's other people who extract the best of X's output and popularize it.

The models have not yet been released, and it looks like someone has already asked about it in the issues: https://github.com/openai/gpt-3/issues/1

I am sure it will be @ https://huggingface.co/models by tomorrow ;-)

How do you go about running a model this large?

They use extensive model parallelism when training. Even TPUs (64GB) or Tesla V100 GPUs (32GB) don’t have enough memory to fit a model into a single child, so you’ll need activation checkpointing or model parallelism.

Realistically you'd be able to train up to 760M param models. For that you'd need 32gb+ VRAM GPUs which I think AWS might have. You can try iwth 16gb VRAM GPUs, but you would need to figure out FP16.

https://github.com/shawwn is doing some work in the GPT-2 space including using TPUs instead -- which has given him pretty good results.

32GB RAM is nowhere near what's necessary. It can barely fit the GPT-2 models on there with a batch size in the single digits. We'll need extensive model parallelism libraries (Zero2 from Microsoft) to run this at all.

I would hazard a guess that they will release versions that will be a smaller size (they have in the past). But in order to run this, you'd just have to use a cloud provider, first guesses say it'll be 500GB+ of just weights that ideally you want in memory.

Could we bank on the Lottery Ticket Hypothesis, distillation, or other model compression algorithms to make these models smaller?

I would guess so, but compressing it by 1/3rd it's size (ie. distilgpt) would still be quite large. To be fair, I don't know if distillation scales like that.

Even the Teslas I use have 256GB memory and those are pretty cutting edge. 500GB in GPU memory is insane.


This post may be tongue in cheek, but realistically that's right. I ran the Facebook chatbot that was posted maybe a month ago and didn't get good performance until I was using many Tesla v100 GPUs, which are $8k each. Thankfully modern CSPs like Azure allow renting rigs that can handle this pretty easily.

How much does that cost?

The largest AWS instances are $30-something/hr on demand, less with reserved time. So pretty expensive.

21.3 USD/h in vast.ai (8X Tesla V100, 118.8 TFLOPS)

More or less the cost of a well trained, experienced customer service representative?

Yeah but the cost will come down.

that's only 256GB, which isn't enough. I'm not sure it's even possible to nvlink 16 v100s. I'd love to try it out for $40/hr if it were possible though.

10^4 petaflop-second/days.

They missed an opportunity to be the first paper to measure their computation in mole flops.

Rather, chemists constantly miss opportunities to use actual numbers instead of their lazy legacy mole nonsense.

Nobody really seems to use SI prefixes beyond peta or occasionally exa. But they could have called this 900 zetta-flop. (10^4 peta-flop/s-days)

But "mole flop" is one of the best units ever. It's better than furlongs per fortnight.

A petaflop/s-day is a nice unit because it's roughly what you get from running the fastest AI accelerators for a day. The V100 they used is 0.13 petaflop/s FP16, for example, and the recently announced A100 ups that to 0.3.

I really feel that these very large language models are able to see us in a way that we can't see ourselves. I'd be curious if someone can come up with psychological experiments that could be conducted against them in a way that helps us understand ourselves collectively (or commonly) rather than as the individual. Sort of like an egoless human essence.

Would be interesting to see if they can learn how animals communicate as well. Create a synthetic buddy for Buddy.

Collective consciousness trapped in a model is coming. I wouldnt call it AI, but an amalgamation of common human thought

If you ever want a random friend to penpal with (or VC or whatever), HMU. Seriously. I would be lucky to hear your thoughts.

I'm kind of scared to see what GPT-10 will be capable of.

But I am really excited to get to play it, to test it out, and to try out my toolset to make sure it will do what I need it to do.

Source: https://talktotransformer.com/ Input: "I'm kind of scared to see what GPT-10 will be capable of."

Nice to see Jared Kaplan branching out into ML. He did fundamental work on CFTs/bootstrap in physics.

Is there a fast.ai like library that allows a novice to try GPT-3?

https://github.com/openai/gpt-3 only contains dataset

Exciting! It's been asked to be integrated into the Hugging Face transformers library already: https://github.com/huggingface/transformers/issues/4658

I wonder if the "Lottery ticket hypothesis" work can be applied to this model to further shrink the number of parameters by 10x, to bring it closer to Google's T5 but with higher accuracy?

They say there are no stupid questions, so here is mine:

If there are Billions of parameters in the SOTA models, how do we argue that they are not over fitting?

That's section 4 of OP.

Thank you. It is quite a labor to even skim through the 50+ page paper. Your poignant reply was quite helpful to draw my attention to the issue of contamination. After reading the section carefully, I think my understanding of over fitting is very much improved at least in so far as models like GPT-3 are concerned.

Clearly, the authors have given careful considerations to the issue of contamination and have provided reasonable analysis and a careful argument regarding over fitting the existing benchmarks.

On the other I was wondering if the authors would like to consider purposefully creating a type of "out of sample data" for "creative evaluation"? Of course, GPT is no stranger to creativity, so it would be a fascinating challenge to come up with methods to create such datasets that are truly creative and challenge GPT-{N} to prove its mettle.

For example, would it be possible to engage a really good creative writer* along with a highly experienced school teacher to take on the Reading Comprehension task and create few "tricky" evaluation samples that not only go above and beyond the contamination objections but also challenge the human intelligence to be careful not to fall into common traps?

This way lies a different evaluation metric - a subjective one perhaps, but it's a start. Just a thought experiment - that's all.

* so that they can come up with new ways to trick GPT/humans a teacher knows the common mistakes the average student makes

Edit: Duh, my head immediately screamed GANs the moment I pressed submit, lol. But I am not sure if GANs make sense for NLP tasks. Like do they make sense if humans/domain experts try to solve them?

You might be interested in the ELECTRA model. It's the solid first success I've seen of a GAN-like framework in NLP. It also has links to why GANs still don't do so great in NLP in its references.

Thanks a lot.

If I may ask one more question, would you happen to know if the authors or other researchers who are entertaining any theoretical work on the experimental design and training methodologies of GPT/BERT? As in why does it work? What is the significance of training via the "fill-in-the-blanks" method?

Don't get me wrong - the work is great and the SOTAs are amazing, I would be just happy to have a chat to discuss and bounce some ideas what all this means and why do these methods seem to be working so well. Papers/articles/blog-posts are always a pleasure to read!

I think it's just kind of understood, so I don't have any real references for you. Filling in "A dog has ___ feet" requires actual facts. Or compare these two:

"The city councilmen refused the demonstrators a permit because they advocated violence. It wasn't the first time the _____ had advocated violence."

"The city councilmen refused the demonstrators a permit because they feared violence. It wasn't the first time the _____ had feared violence."

The syntax is identical. The words are identical, except that I swapped "advocated" out for "feared". When I swap it, the ____ changes from "demonstrators" to "councilmen." Think about what kinds of reasoning and experience and knowledge it takes you to resolve which group "they" refers to in this sentence.

Most blanks might be simpler and just correspond to learning english, like when the blank is "the," but learning that is a feat too. Filling in the blanks that require broader knowledge requires somehow capturing that broader knowledge.

Haven't read the paper, but it's still unclear how the mechanism of one shot learning works. If the weights are not being updated, how is it "learning"?

GPT 3 is obscoleted by order of magnitudes. SMIM has achieved 4.6 of perplexity vs 20 for GPT 3 with and with a thousand less parameters https://arxiv.org/abs/2003.02645 This is the breakthrough of the year and will be silent until the few nerds like me propagate the news to the mainstream

Duplicate thread at https://news.ycombinator.com/item?id=23345449 (with github link)

Is it sentient yet? /s

Real question, are they going to release the full model?

It took them a while to release GOT-2 full model because of the implications for things like spambots. The GPT-3 paper indicates that they have been monitoring forums and noticed that bad actors haven't really been using GPT-2 for their own devices. That's unsurprising because GPT-2 takes a lot of hardware to run and I assume it messes with the economics of spamming.

GPT-3 will take significantly more resources to run. However, part of me doesn't want it released ever because of the implications of what bad actors could do with it.

> That's unsurprising because GPT-2 takes a lot of hardware to run and I assume it messes with the economics of spamming.

GPT-2 doesn't require as many resources to run as you would expect: even from the 1.5B model, you can mass-produce passing spam comments for less than a dollar an hour in GPU costs: https://docs.aitextgen.io/tutorials/generate_1_5b/

Pure text spam in general is less effective in 2020; it's content that harder to fake (e.g. deepfakes) that shakes up social media, and why it's good FB/Twitter have proactively taken a stance against it.

I am confused. By content you mean audio/visual content in contrast to textual content?

What is this and why does it take the top two spots on HN?

One thread will (probably) be merged into the other, but GPT-2 was an extremely popular OpenAI project that generated long, realistic-sounding text/articles if you gave it a simple starting sentence or topic sentence. GPT-3 is an iteration on that, so it's likely a huge improvement.

It doesn't sound like it's an improvement at all, but instead requires less training data to produce worse results?

MUCH less training for SLIGHTLY worse results. It's a huge benefit to be able to make this trade-off.

Is the reverse also true? If you have the training data necessary for "good" results on GPT-2, is it generally correct to assume that it would provide better results on your task than GPT-3?

If you can answer this question without running both models over the data set, you've got a very good paper on your hands.

This is a massive improvement to the extent that previously you had to retrain (ie update) the stock model on a specialized dataset to get good results for a particular task.

GPT-2 was a groundbreaking advancement in NLP, this is an iteration on that. A general purpose language model that can answer questions, write full (mostly) human indistinguishable articles, do some translation, etc...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact