Also, if you're going to do a 30k foot view of a technical topic, you might want to tell people what GPT3 is somewhere in there.
I feel the need to defend the author though, it's hard to make research accessible while still distilling valuable insight. I think his post on transformer networks  did a good job for example, and you'll appreciate the lack of animations.
In addition to your link, I've found a really good Transformer explanation here (backed by a Github repo w/ lively Issues talk): http://www.peterbloem.nl/blog/transformers
Additionally, there's a paper on visualizing self-attention: https://arxiv.org/pdf/1904.02679.pdf
This article and the animations definitely helped me a lot in understanding this. I learned quite a few things, so thanks a lot to the author!
But these animations/diagrams are so high level that they could be used for Explaining all sorts of NLP models from the past 5 years.
> Nevertheless, the fact is that there is nothing as dreamy and poetic, nothing as radical, subversive, and psychedelic, as mathematics. It is every bit as mind blowing as cosmology or physics (mathematicians conceived of black holes long before astronomers actually found any), and allows more freedom of expression than poetry, art, or music (which depend heavily on properties of the physical universe). Mathematics is the purest of the arts, as well as the most misunderstood.
> This is why it is so heartbreaking to see what is being done to mathematics in school. This rich and fascinating adventure of the imagination has been reduced to a sterile set of “facts” to be memorized and procedures to be followed. In place of a simple and natural question about shapes, and a creative and rewarding process of invention and discovery, students are treated to this: Triangle Area Formula - A = 1/2 b h
> “The area of a triangle is equal to one-half its base times its height.” Students are asked to memorize this formula and then “apply” it over and over in the “exercises.” Gone is the thrill, the joy, even the pain and frustration of the creative act. There is not even a problem anymore. The question has been asked and answered at the same time — there is nothing left for the student to do.
Seeing all these demonstrations is starting to make me a little bit nervous and I feel it is time for a long term plan.
Automating the harder and more interesting parts of programming is many orders of magnitude more difficult. This requires a true understanding of the problem domain and the ability to "think." GPT-3 and similar are just really good prediction engines that can extrapolate based on training data of what's already been done.
The answer therefore is the same as "how do I stay competitive vs. lower skill offshore labor?" You need to level up and become skilled in higher-order thinking and problem solving, not just grinding out glue code and grunt work.
I’m surprised React doesn’t have something like that. At least not that I’m aware of. Is there a GUI interface builder for React?
Some of the examples from the tweets:
> Q: find occurrences of the string "pepsi" in every file in the current directory recursively
> A: grep -r "pepsi"*
> Q: run prettier against every file in this directory recursively, rewriting the files
> A: prettier --write "*.js"
It also seems to work the other way as well. You can set it up to give it a shell command and have it write a plain english description of it.
Granted sometimes the results are wrong, and in a video I saw someone playing with it like 1/10 of the commands were subtly wrong or didn't have enough context to generate what you really meant, but as a starting point it seems like such a powerful tool!
I personally spend a lot of time looking up shell command flags, thinking of ways to combine tools to get the data I want out of a log or something, or running help commands to figure out the kubectl incantation that will just let me force a deployment to redeploy with the latest image.
Imagine having a VS-Code style command palette where I can just type a "plain english" description of what i'm trying to do, and have it generate a command that I can tweak or just run. Turning a 10 minute process of recalling esoteric flags or finding documentation into 10 seconds of typing.
If it's really as good as it seems, imagine being able to type stuff like "setup test scaffolding for the LoginPage component" and having it just generate a "close enough" starting point!
On the other hand, when it’s good enough to replace us, it’s also good enough to replace basically any job where you transform a written request into some written output, e.g. law, politics, pharmacology, hedge fund management, and writing books.
I have no idea how to prepare, only that I should.
(Edit: what makes us redundant may well not be in the GPT family, but I do expect some form of AGI to be good-enough in 20 years).
The biggest problem with GPT or any massive neural net is explainability. When it doesn’t do the correct thing, no one quite knows why. GPT makes all sorts of silly mistakes.
The human brain, albeit being a form of a neural net, can do some very deep symbolic reasoning about things. Artificial Neural nets just don’t to that (Yet). We haven’t figured out that not have I seen a system that is close. We haven’t got generic neural nets that can perform arithmetic operations to arbitrary precision. For computers to learn proper language, they have to embed themselves into the world for years like children do and learn the relationship of objects in the world.
So if I were a fake comment house, I’d worry about GPT. Not so much if I were a programmer or a lawyer. We do some very deep symbolic thinking to produce our work. If computers are able to replace us, they can probably replace a large part of humanity. At which point we have way bigger problems to worry about.
Symbolic reasoning is a very hard problem to crack. Something like “how old was Obama’s 2nd child when the US hit 4 digit deaths due to covid-19?”. Answering that question not only requires context like “4 digit” means 1000, it requires a bunch of lookups and ability to break a big problem into smaller problems.
Siri/Ok Google/Alexa/Cortana/GPT3 - all of them fail.
They can’t even answer “Find fast food restaurants around me that aren’t mcdonalds”.
Also remember that ultimately even if GPT-X is successful at transforming text into working code, all that's done is essentially define a new programming language. Instead of writing Python, you'll write GPT-X-code at a higher level.
Of course GPT-3 is not there but it's only a matter of time: the capabilities are there. You are already thinking in decades which is the right mindset. Fortunately, tech is not something that will be done ever so there will always be opportunities just not in the fields we are looking at this time --digital products like web or mobile apps will be as exciting as a custom invoicing Windows app in a matter of years, but then you have IoT, autonomous vehicles, blockchain, and whatnot. Stay ahead of the ball as an engineer.
Of course you can also move up the food chain and become a manager or technical architect or lead.
But then I'm in almost in my fifties so I'm only looking at three more decades in the best case.
However, particularly as a full stack dev, I think that it will create more opportunities for jobs than concurrence. You mention 10-20 years ahead, if you look that same horizon back in the past it seems (I wasn't working then) that the job also changed significantly, without making devs obsolete.
AGI might happen in our lifetime (I hope so), but I'm dubious that it will happen through a singularity . Therefore, I'm not worried that as tech experts we won't have time to adapt.
 this blog post by Robin Hanson is from 2014, but recent research events especially from OpenAI have only reinforced his points https://www.overcomingbias.com/2014/07/30855.html
That said I really think that it's overblown for now.
It's not clear how much the demos have been gamed for presentation, and it seems more of an opportunity than a threat - it will still need devs to put stuff together, and (assuming it is as impressive as demoed) will take a lot of the donkey work away.
This allows the output to loop backwards, providing a rudimentary form of memory / context beyond just the input vector.
I.e. a vision network expects 1 input per pixel (or more for encoding color) and so it's up to you to "format" your given image into what the model expects.
But what about GPT-3, which takes in "free text?" The animations in the post show 2048 input nodes, does this mean it can only take in a maximum of 2048 tokens? Or will it somehow scale beyond that?
However, model training scales quadratically as input size increases which makes building larger models more difficult (which is why Reformer is trying workarounds to increase the input size).
It goes about doing that the same way it works in general which is memorizing sequences that are similar and outputting the corresponding sequence that follows. For example, if the training data has something like "These are the countries that have over a million people: <countries>" I would not be surprised if it returned <countries> for your query. However, if your query was "less than a million" I would be very surprised if it would return the other countries.
How are you able to use it? Can I try it out?
I have access to the API.
If you only give it "Here are the countries whose population exceeds 1 million:" the model has a chance to go on a tangent / inconsistently structured output / inconsistent values in output (examples when generating at temp=0.7: https://gist.github.com/minimaxir/86e09253f9e05058eb1e96de2b... )
If you give it the same prompt with a doubleline break and a "1.", it behaves much better.
> Training is the process of exposing the model to lots of text. It has been done once and complete.
If so, wouldn't this be evidence that the model is using its mind-blowingly large latent space to memorize surface patterns that bear no real relationship to the underlying language (as most people suspect)?
I suppose this comes back to my question about Transformer models in general - the use of a very large attention window of BPE tokens.
When I finish reading a paragraph, I can probably use my own words to explain it. But there's no chance I could even try to recreate the sentences using the exact words I just read. So I doubt our brains are keeping some running stack of the last XXXX words, or even some smaller distributed representation thereof.
It's more plausible that we're using some kind of natural hierarchical compression/comprehension mechanism that operating on the character/word/sentence/paragraph level.
It certainly feels like GPT-3 is using a huge parameter space to bypass this mechanism and simply learn a "reconstitutable" representation.
Either way, I'd be really interested to see how it handles character-level input symbols.
The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.
> When I finish reading a paragraph, I can probably use my own words to explain it. But there's no chance I could even try to recreate the sentences using the exact words I just read.
Sure you could, you could look up and copy it which is an ability GPT-3 also needs to model if its to successfully learn from the internet where people do that all the time. :)
> the use of a very large attention window of BPE
You're also able to remember chunks from not long before. You just don't remember all of them. I'm sure people working on transformers would _prefer_ to not have it remember everything for a window (and instead spend those resource costs elsewhere), but it's necessary that the attention mechanism be differentiable for training, and that excludes obvious constructions. (E.g. you can't just bolt a nearest-key->value database on the side and simply expect it to learn to use it).
That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?
> Sure you could, you could look up and copy it which is an ability GPT-3 also needs to model if its to successfully learn from the internet where people do that all the time. :)
That's right - but then we're just talking about memorization and regurgitation. Sure, it's impressive when done on a large scale, but is it really a research direction worth throwing millions of dollars at?
> I'm sure people working on transformers would _prefer_ to not have it remember everything for a window (and instead spend those resource costs elsewhere), but it's necessary that the attention mechanism be differentiable for training, and that excludes obvious constructions.
Of course, but all of my whinging about Transformers is a roundabout way of saying "I'm not convinced that the One True AI will unquestionably use some variant of differentiation/backpropagation".
> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.
BPE's aren't even words for the most part. Are all native Chinese authors non-conscious memorization hacks? :)
Elasmosaurus is a genus of plesiosaur that lived in North America during the Campanian stage of the Late Cretaceous period, about 80.5 million years ago. The first specimen was discovered in 1867 near Fort Wallace, Kansas, US, and was sent to the American paleontologist Edward Drinker Cope, who named it E. platyurus in 1868. The generic name means "thin-plate reptile", and the specific name means "flat-tailed". Cope originally reconstructed the skeleton of Elasmosaurus with the skull at the end of the tail, an error which was made light of by the paleontologist Othniel Charles Marsh, and became part of their "Bone Wars" rivalry. Only one incomplete Elasmosaurus skeleton is definitely known, consisting of a fragmentary skull, the spine, and the pectoral and pelvic girdles, and a single species is recognized today; other species are now considered invalid or have been moved to other genera.
Where did the Elasomosaurus live?
Where was the first Elasomosaurus discovered?
Fort Wallace, Kansas
How long ago did Elasmosaurus live?
80.5 million years ago
When was Elasomosaurus discovered?
Was Elasmosaurus capable of leaping?
Yes, the two small, sharp teeth on either side of the lower jaw contain the necessary enzymes to propel the animal upwards by using muscles. However, in the developing skeleton, the upper and lower jaws had a tendency to grip the body,