Hacker News new | past | comments | ask | show | jobs | submit login
How GPT3 Works – Visualizations and Animations (jalammar.github.io)
398 points by dsr12 11 months ago | hide | past | favorite | 98 comments

I enjoy a good visualization, but at best they're high level graphical powerpoints, and in this case I found the animations more distracting than useful.

Also, if you're going to do a 30k foot view of a technical topic, you might want to tell people what GPT3 is somewhere in there.

I agree that in this case the animated parts of the graphics were not needed, it's an easy pitfall to be distracted by the beautiful aspects of visualisations when crafting them.

I feel the need to defend the author though, it's hard to make research accessible while still distilling valuable insight. I think his post on transformer networks [1] did a good job for example, and you'll appreciate the lack of animations.

[1] https://jalammar.github.io/illustrated-transformer/

Yes this seems like an early work in progress, compared to Jay's previous Transformer articles.

In addition to your link, I've found a really good Transformer explanation here (backed by a Github repo w/ lively Issues talk): http://www.peterbloem.nl/blog/transformers

Additionally, there's a paper on visualizing self-attention: https://arxiv.org/pdf/1904.02679.pdf

Can't edit the post anymore so adding it here - further research reading on improving the current attention model: https://www.reddit.com/r/MachineLearning/comments/hxvts0/d_b...

That's a good complement, thank you for the links

I few this comment is overly negative. Just to provide a counter-datapoint, I have seen quite a bit of GPT3 on HN lately but could not understand the research papers at all. It’s too abstract, and I often fail to see what they really mean.

This article and the animations definitely helped me a lot in understanding this. I learned quite a few things, so thanks a lot to the author!

It explains a sequence to sequence model which, granted, is a class of models that GPT-3 falls under.

But these animations/diagrams are so high level that they could be used for Explaining all sorts of NLP models from the past 5 years.

Openning OP's page on a slow 4G connection via hotspotting from my smartphone, the whole page makes no sense because I can't know if I should wait for something to move or carry on.

My head was getting dizzy and had to stop mid way. People were smart enough to create animations but not sensitive enough to know whether it is too much.

Please don't use terms like "magic" when trying to explain things to people. They never point out where the "magic" part lines up to any of their other explanation.

Author here. Thank you. I feel an important element of this type of writing is what complexity to show and what to hide at different points. "Magic" is just to say "don't worry about the contents of this box yet, we'll get to it". It's what we discuss right after the visual. Sorry that came out as confusing. I'll add a note to the following figure saying that's the magic.

I get the sense that you're trying to mask the simplicity of predicting the next-most-likely-word after training your app, ala markov chains, under the guise of "magical AI." Providing an error threshold when it spits out the wrong response to a phrase seems to be worsening its natural ability as well.

As a physicist, we love these kind of magics introduced by mathematician.

a lot of maths is basically just "number magic". Apply the formula get the desired output.

That's how it's often taught, which is a real shame. Paul Lockhart wrote an elegant piece about this, titled A Mathematician's Lament [0]:

> Nevertheless, the fact is that there is nothing as dreamy and poetic, nothing as radical, subversive, and psychedelic, as mathematics. It is every bit as mind blowing as cosmology or physics (mathematicians conceived of black holes long before astronomers actually found any), and allows more freedom of expression than poetry, art, or music (which depend heavily on properties of the physical universe). Mathematics is the purest of the arts, as well as the most misunderstood.

> ...

> This is why it is so heartbreaking to see what is being done to mathematics in school. This rich and fascinating adventure of the imagination has been reduced to a sterile set of “facts” to be memorized and procedures to be followed. In place of a simple and natural question about shapes, and a creative and rewarding process of invention and discovery, students are treated to this: Triangle Area Formula - A = 1/2 b h

> “The area of a triangle is equal to one-half its base times its height.” Students are asked to memorize this formula and then “apply” it over and over in the “exercises.” Gone is the thrill, the joy, even the pain and frustration of the creative act. There is not even a problem anymore. The question has been asked and answered at the same time — there is nothing left for the student to do.

[0]: https://www.maa.org/external_archive/devlin/LockhartsLament....

That is absolutely not "magic!"

depends how good you are at reading maths proofs. Its like how hardware is magical to some of us because we're too spooked to grab a soldering iron.

I love the amount of efforts Jay puts in his posts to develop intuitions. And, I wonder if there are some open source projects out there to help make simple animations for researchers who like to blog.

I'm curious, what can I, as a full-stack developer, do to prepare for things like GPT-X eventually making a lot of the work I do obsolete in the next 10-20 years?

Seeing all these demonstrations is starting to make me a little bit nervous and I feel it is time for a long term plan.

The parts of programming that are going to get automated are going to be the parts that require little skill, take a long time, and are boring as hell: writing boilerplate CRUD code, wiring up buttons to actions, etc.

Automating the harder and more interesting parts of programming is many orders of magnitude more difficult. This requires a true understanding of the problem domain and the ability to "think." GPT-3 and similar are just really good prediction engines that can extrapolate based on training data of what's already been done.

The answer therefore is the same as "how do I stay competitive vs. lower skill offshore labor?" You need to level up and become skilled in higher-order thinking and problem solving, not just grinding out glue code and grunt work.

Ruby on Rails scaffolding didn't make backend developers obsolete. I know you said GPT-X, but GPT-3 is at the boundaries of technology. The jump to GPT-4 will either take much longer or be much less impressive than the jump from GPT-2 to GPT-3. I would say that your job is safe from automation from GPT. But the technology that might put you out of job, which I personally think will not be something like a neural network, might be spontaneously discovered in the next 10-20 years just like the spontaneity of smart phones. To answer your question, be a human; be adaptable, be useful.

Why don't you think gpt 4 will be as impressive.

I'm actually looking forward to more code generation tools. Things like wiring up a button aren't stimulating and I wouldn't mind that level of programming becoming automated.

That’s what I loved about Visual Basic. You could just draw your user interface and specify actions and then just fill in the one or two lines of code that need to run when that button is pressed.

I’m surprised React doesn’t have something like that. At least not that I’m aware of. Is there a GUI interface builder for React?

There are a handful of projects out there, such as BuilderX[0] and React Studio[1].

[0] https://builderx.io/ [1] https://reactstudio.com/

I am as well, especially ever since i saw [1]. It's a small test that someone tried with GPT-3 that translates natural language descriptions and phrases into shell commands.

Some of the examples from the tweets:

> Q: find occurrences of the string "pepsi" in every file in the current directory recursively

> A: grep -r "pepsi"*

> Q: run prettier against every file in this directory recursively, rewriting the files

> A: prettier --write "*.js"

It also seems to work the other way as well. You can set it up to give it a shell command and have it write a plain english description of it.

Granted sometimes the results are wrong, and in a video I saw someone playing with it like 1/10 of the commands were subtly wrong or didn't have enough context to generate what you really meant, but as a starting point it seems like such a powerful tool!

I personally spend a lot of time looking up shell command flags, thinking of ways to combine tools to get the data I want out of a log or something, or running help commands to figure out the kubectl incantation that will just let me force a deployment to redeploy with the latest image.

Imagine having a VS-Code style command palette where I can just type a "plain english" description of what i'm trying to do, and have it generate a command that I can tweak or just run. Turning a 10 minute process of recalling esoteric flags or finding documentation into 10 seconds of typing.

If it's really as good as it seems, imagine being able to type stuff like "setup test scaffolding for the LoginPage component" and having it just generate a "close enough" starting point!

[1] https://twitter.com/harlandduman/status/1282132804034150400

On the one hand, yes, you should prepare.

On the other hand, when it’s good enough to replace us, it’s also good enough to replace basically any job where you transform a written request into some written output, e.g. law, politics, pharmacology, hedge fund management, and writing books.

I have no idea how to prepare, only that I should.

(Edit: what makes us redundant may well not be in the GPT family, but I do expect some form of AGI to be good-enough in 20 years).

There’s a good book called “Rebooting AI” that does some fundamental analysis about current state of deep learning and its applications.

The biggest problem with GPT or any massive neural net is explainability. When it doesn’t do the correct thing, no one quite knows why. GPT makes all sorts of silly mistakes.

The human brain, albeit being a form of a neural net, can do some very deep symbolic reasoning about things. Artificial Neural nets just don’t to that (Yet). We haven’t figured out that not have I seen a system that is close. We haven’t got generic neural nets that can perform arithmetic operations to arbitrary precision. For computers to learn proper language, they have to embed themselves into the world for years like children do and learn the relationship of objects in the world.

So if I were a fake comment house, I’d worry about GPT. Not so much if I were a programmer or a lawyer. We do some very deep symbolic thinking to produce our work. If computers are able to replace us, they can probably replace a large part of humanity. At which point we have way bigger problems to worry about.

Symbolic reasoning is a very hard problem to crack. Something like “how old was Obama’s 2nd child when the US hit 4 digit deaths due to covid-19?”. Answering that question not only requires context like “4 digit” means 1000, it requires a bunch of lookups and ability to break a big problem into smaller problems.

Siri/Ok Google/Alexa/Cortana/GPT3 - all of them fail.

They can’t even answer “Find fast food restaurants around me that aren’t mcdonalds”.

Great comment with fantastic context and examples!

This piece my minimaxir is a nice (less-bloated) article about GPT-3 that can help you see few more things like the limitations https://minimaxir.com/2020/07/gpt3-expectations/

HN discussion about that post: https://news.ycombinator.com/item?id=23891226

Get good at specifying and documenting product requirements apparently.

Also remember that ultimately even if GPT-X is successful at transforming text into working code, all that's done is essentially define a new programming language. Instead of writing Python, you'll write GPT-X-code at a higher level.

Despite sharing some similarities, there’s a huge difference between those two languages.

Well if it's trained on GitHub samples make sure you only feed it horribly bad code.

There are other GPT models trained on Github. It can write functions from comments describing the requirements.

GPT-x will be able to perform most copy-and-paste operations soon enough so that's the kind of jobs that would be made obsolete by it. Low code and point and click jobs are the ones that will follow. At first it will be "aiding" developers by suggesting code, and then GPT's successors will finally deliver the "no code, only a business description promise" that has been hanging on the industry for decades.

Of course GPT-3 is not there but it's only a matter of time: the capabilities are there. You are already thinking in decades which is the right mindset. Fortunately, tech is not something that will be done ever so there will always be opportunities just not in the fields we are looking at this time --digital products like web or mobile apps will be as exciting as a custom invoicing Windows app in a matter of years, but then you have IoT, autonomous vehicles, blockchain, and whatnot. Stay ahead of the ball as an engineer.

Of course you can also move up the food chain and become a manager or technical architect or lead.

Managing people is something that I don't feel any AI would be able to do in the foreseeable future.

But then I'm in almost in my fifties so I'm only looking at three more decades in the best case.

Being worried about new potentially disruptive tech is legitimate, it's hard to see our place in an environment we can't predict.

However, particularly as a full stack dev, I think that it will create more opportunities for jobs than concurrence. You mention 10-20 years ahead, if you look that same horizon back in the past it seems (I wasn't working then) that the job also changed significantly, without making devs obsolete.

AGI might happen in our lifetime (I hope so), but I'm dubious that it will happen through a singularity [1]. Therefore, I'm not worried that as tech experts we won't have time to adapt.

[1] this blog post by Robin Hanson is from 2014, but recent research events especially from OpenAI have only reinforced his points https://www.overcomingbias.com/2014/07/30855.html

I wouldn't worry about it. If we can automate programmers, we can also automate most other jobs.

Your main value add as a developer is understanding the problem domain. Machines won't be able to do this in your, or your children's lifetime, outside some important, but very constrained niches.

I think that Computer Science proper will retain more value than webdev & "full stack" development.

That said I really think that it's overblown for now.

CASE tools have been evolving for 50+ years old and they're still not conquering the software development world. You'll be fine.


Do the same thing as you did when cloud services took away all the ops jobs.

Hedge. Save a larger percentage of your income & invest it across companies that are most likely to profit off your job being automated.

This type of model can only generate mediocre output. It is trained on a large corpus of text from around the internet, with little quality control.

I don't feel nervous at all. What am I missing?

There have been demos of it writing code from English comments.


It's not clear how much the demos have been gamed for presentation, and it seems more of an opportunity than a threat - it will still need devs to put stuff together, and (assuming it is as impressive as demoed) will take a lot of the donkey work away.

I would need to see a lot more than that to be impressed/worried. I love how there's even a serious bug in there - discounts at 80% instead of 20%

A significant chunk of what devs are paid for is the donkey work. Making every dev significantly more productive increases the supply of dev power relative to the demand, dropping the price.

Many future opportunities can't be predicted, but having savings is a good hedge for not being able to work for just about any reason.

Why not learn some deep learning and participate in the coming automation? You are in a better position than most given it is just code

ML is the most automatable domain of programming there is. I never understood this argument.

It may be that deep learning as we know it (TF, PyTorch) is going to be replaced by prompting large models, thus making most applications straightforward for anyone to use.

By the time we can prompt large models and it gives us everything we want, then our work is clearly done. Before then, there's still work left.

Vote for policies that support humans.

It's not going to happen.

What am I missing? How is any of his visualizations GPT-3 specific and not, say, a deep learning LSTM from years ago?

AFAIK, the only thing new about GPT-3 is its massive size, the architecture is completely conventional, so the same as those you've seen from a few years ago.

The visualizations seem to show non-recurrent networks whereas my understanding is that one of the important differences between GPT1 and GPT2 & 3 is the use of recurrent networks.

This allows the output to loop backwards, providing a rudimentary form of memory / context beyond just the input vector.

While models such as XLNet incorporate recurrence, GPT-{2,3} is mostly just a plain decoder-only transformer model.[1]

[1]https://arxiv.org/abs/2005.14165 [2]https://d4mucfpksywv.cloudfront.net/better-language-models/l...

Just curious. What languages (human languages) were used in the training data set of GPT3? Is it trained only on English texts and grammar, or is it transcending language barriers?

The vast majority (>93%) is English (by document): https://github.com/openai/gpt-3/blob/master/dataset_statisti...

My (very limited) understanding of AI models is the input "shape" has to be well defined.

I.e. a vision network expects 1 input per pixel (or more for encoding color) and so it's up to you to "format" your given image into what the model expects.

But what about GPT-3, which takes in "free text?" The animations in the post show 2048 input nodes, does this mean it can only take in a maximum of 2048 tokens? Or will it somehow scale beyond that?

Correct, you can only input up to 2048 tokens total (this is a big improvement over GPT-2's 1024 input size). You can use sliding windows to continue generating beyond that.

However, model training scales quadratically as input size increases which makes building larger models more difficult (which is why Reformer is trying workarounds to increase the input size).

Yes, there is a limited amount of input. In addition, each token may be a word or only part of a word, depending on how common it is. Common words get one token and uncommon words are divided into pieces, each of which gets a token.

Is GPT-3 even a computer vision AI model?

No, but there ain't no rule about flattening pixels and using it as input data. https://news.ycombinator.com/item?id=23554944

So can something like this answer a query like "give me the names of countries with population exceeding a million" How would it go about doing that?

The answer is yes and no. First, it will produce an output for any input. What you really mean is answer a query correctly.

It goes about doing that the same way it works in general which is memorizing sequences that are similar and outputting the corresponding sequence that follows. For example, if the training data has something like "These are the countries that have over a million people: <countries>" I would not be surprised if it returned <countries> for your query. However, if your query was "less than a million" I would be very surprised if it would return the other countries.

Just tested the "less than a million" prompt: it works, kinda.


Why did it output so many answers? Also it missed Iceland.

How are you able to use it? Can I try it out?

Each answer is an independent generation from the model.

I have access to the API.

Is it hard to get access?

It depends on how you construct the prompt.

If you only give it "Here are the countries whose population exceeds 1 million:" the model has a chance to go on a tangent / inconsistently structured output / inconsistent values in output (examples when generating at temp=0.7: https://gist.github.com/minimaxir/86e09253f9e05058eb1e96de2b... )

If you give it the same prompt with a doubleline break and a "1.", it behaves much better.

Can someone explain this - what is meant by "complete"?

> Training is the process of exposing the model to lots of text. It has been done once and complete.

I think this is a typo. Perhaps the author means to say that the training process has been completed?

Yes, from now on we can only control it by prompting. But OpenAI announced plans for further fine-tuning on client data.

This fun but IMO too simplified. For example it's really important to know that GPT-3 does not see "words" it sees byte pair encodings. Which are for the most part smaller than words but larger than individual characters. This has immediate implications for what GPT-3 can and cannot do. It can reverse a sentence (This cat is cute -> cute is cat this) but it cannot reliably reverse a word (allegorical -> lacirogella).


Interesting to consider whether this limitation of BPE points to a more fundamental issue with the model. Does GPT-3 "fail" when BPE is replaced with the conventional English alphabet as input symbols (for various definitions of "fail")?

If so, wouldn't this be evidence that the model is using its mind-blowingly large latent space to memorize surface patterns that bear no real relationship to the underlying language (as most people suspect)?

I suppose this comes back to my question about Transformer models in general - the use of a very large attention window of BPE tokens.

When I finish reading a paragraph, I can probably use my own words to explain it. But there's no chance I could even try to recreate the sentences using the exact words I just read. So I doubt our brains are keeping some running stack of the last XXXX words, or even some smaller distributed representation thereof.

It's more plausible that we're using some kind of natural hierarchical compression/comprehension mechanism that operating on the character/word/sentence/paragraph level.

It certainly feels like GPT-3 is using a huge parameter space to bypass this mechanism and simply learn a "reconstitutable" representation.

Either way, I'd be really interested to see how it handles character-level input symbols.

> Does GPT-3 "fail" when BPE is replaced with the conventional English alphabet as input symbols (for various definitions of "fail")?

The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.

> When I finish reading a paragraph, I can probably use my own words to explain it. But there's no chance I could even try to recreate the sentences using the exact words I just read.

Sure you could, you could look up and copy it which is an ability GPT-3 also needs to model if its to successfully learn from the internet where people do that all the time. :)

> the use of a very large attention window of BPE

You're also able to remember chunks from not long before. You just don't remember all of them. I'm sure people working on transformers would _prefer_ to not have it remember everything for a window (and instead spend those resource costs elsewhere), but it's necessary that the attention mechanism be differentiable for training, and that excludes obvious constructions. (E.g. you can't just bolt a nearest-key->value database on the side and simply expect it to learn to use it).

> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.

That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?

> Sure you could, you could look up and copy it which is an ability GPT-3 also needs to model if its to successfully learn from the internet where people do that all the time. :)

That's right - but then we're just talking about memorization and regurgitation. Sure, it's impressive when done on a large scale, but is it really a research direction worth throwing millions of dollars at?

> I'm sure people working on transformers would _prefer_ to not have it remember everything for a window (and instead spend those resource costs elsewhere), but it's necessary that the attention mechanism be differentiable for training, and that excludes obvious constructions.

Of course, but all of my whinging about Transformers is a roundabout way of saying "I'm not convinced that the One True AI will unquestionably use some variant of differentiation/backpropagation".

> nmfisher 1 day ago [–]

> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.

That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?

BPE's aren't even words for the most part. Are all native Chinese authors non-conscious memorization hacks? :)

Hey how did you do those animations?

Apple Keynote

Could I give gpt-3 a legal contract or a terms of service and then ask it questions about it?

Yes, GPT-2 could also do that. It generally works best if you give it some examples to start off. For example, I actually ran the following prompt through the full GPT-2. Everything after "How long ago did Elasmosaurus live?" is GPT-2 talking.

Elasmosaurus is a genus of plesiosaur that lived in North America during the Campanian stage of the Late Cretaceous period, about 80.5 million years ago. The first specimen was discovered in 1867 near Fort Wallace, Kansas, US, and was sent to the American paleontologist Edward Drinker Cope, who named it E. platyurus in 1868. The generic name means "thin-plate reptile", and the specific name means "flat-tailed". Cope originally reconstructed the skeleton of Elasmosaurus with the skull at the end of the tail, an error which was made light of by the paleontologist Othniel Charles Marsh, and became part of their "Bone Wars" rivalry. Only one incomplete Elasmosaurus skeleton is definitely known, consisting of a fragmentary skull, the spine, and the pectoral and pelvic girdles, and a single species is recognized today; other species are now considered invalid or have been moved to other genera.

Where did the Elasomosaurus live? North America

Where was the first Elasomosaurus discovered? Fort Wallace, Kansas

How long ago did Elasmosaurus live? 80.5 million years ago

When was Elasomosaurus discovered? 1867

Was Elasmosaurus capable of leaping? Yes, the two small, sharp teeth on either side of the lower jaw contain the necessary enzymes to propel the animal upwards by using muscles. However, in the developing skeleton, the upper and lower jaws had a tendency to grip the body,

I don't know why everyone else is saying yes. No, in practicality you can't. The window is way too short for most contracts, let alone TOS agreements.

Yes, and it will give you answers, it will give you an answer that is a blend of answers to similar questions asked before.

But say I give it a new piece of text it has never seen before. Can it answer questions about that or it won’t really care about what’s in that text?

It will give an answer, whatever is formed by extending the input you give it. The answer will be based on the text you provide, so in that sense it "cares" about it. Whether the answer is any good is another matter. But maybe it will find something based on its training data that relates.

Or, a body of scientific literature, but how do you check the answers?

Looks like a big rainbow table generated by AI. Thanks Jay for creating this amazing chart.

It has such a confusing name.. I always think this is something interesting, and then it's just more ML bunk.

I find it funny that the two first comments has to do with how they did the _animations_.

yeah, mainly because its generally getting more difficult to distill ideas and keep people interested, highly applicable to communications in engineering

"OpenAI" isn't "Open" therefore it's worthless. Next.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact