As a fan of the field, however, it is undoubtedly very cool.
I still think there's tasks that GPT-3 has no chance of tackling (say, generating code to solve a novel programming task), but the bitter lesson is a bitter one indeed...
I think the big point, and what scares me a bit, is that we have yet to discover any sort of fundamental conceptual limit to the Transformer architecture.
Many of ML experiments, though, are exactly what you said: we ultimately have no slightest idea, what we are doing, but maybe if we throw enough compute-power into gradient-descent over some generic enough network, it will learn to do something interesting. Well, I tell you what, theoretically you need only 2 wide layers of neurons to compute anything there is to compute, there just isn't enough computational power in the entire world to possibly come up with the "correct" neuron weights entirely by chance, so what the research mostly really is about is coming up with various tricks to make do with less computational resources, which is essentially making sense of some computational problems and making neural networks "less random", to direct the learning process. It is either about new neural architectures, new ways of posing questions for NN to answer, ways to make more training data out of the same amounts of available external data, or the alternatives for the gradient-descent learning approach altogether.
So, to wrap it up. It may be kinda interesting to know that GPT-2 type of network didn't reach its full capacity, and if we scale it up even more, it still learns something new. Inspecting how it behaves also might lead to some insights or new possible applications. But ultimately, if there's no novelty of any kind in an experiment, training an NN nobody can realistically reproduce is a great way to show off (or to achieve some practical goal for the owner of this NN, if it was used to find something practical), but doesn't really contribute much (or anything) to ML research.
2-wide layers can compute any computable function, but they do not scale in any manageable fashion with problem complexity. (Because you have to model problems as increasingly large lookup tables, whose size explodes rapidly.) The big thing about GPT is that so far, improvement in GPT size has corresponded to a .. I think logarithmic?- improvement in both quality of response and complexity of the problem domain that GPT can model. GPT scales, in exactly the way that a 2-layer network doesn't, and we don't yet have a handle on at what size it stops scaling.
I don't know physics enough to give a sensible example, but in physics, as generally in science, a theory is first formed to explain a set of observations, then the theory is tested against new observations and either discarded, if the new observations do not agree with the theory, or accepted otherwise (where "accepted" doesn't necessarily mean that all debate will cease and everyone will agree the theory is true). So, in short, the point of the LHC is to test the predictions of some theory (I think what it's testing is the standard model but not sure) and its size is a function of how easy or hard it is to test such preditions, not an attempt to "look harder" just in case something comes up. And if "something completely new" does come up, then the cycle starts all over again- with a new theory and new experiments to test its predictions. Just "discovering something completely new", i.e. making a new observation, doesn't really tell us anything until we have an explanation for it, in the form of a scientific theory that can be tested.
In machine learning we do not have any theory to guide our experiments so the done thing is to try things and see what works. So just because LHC and GPT-3 are both "large", doesn't mean they have the same goals- or that the goal of the LHC is just to be large, because large is better.
To clarify "will it work if we scale it?" is not much of a scientific claim. Because it doesn't tell us anything new. We've known that it's possible to improve performance by spending more computing resources for a long time now. If I run bubblesort on a supercomputer for a month and tell people that I have sorted a list of integers larger than anyone has ever done before- will people fall over their chairs in amazement? Of course not. Not in any other field of computer science (except perhaps high performance computing) is scaling resources an acceptable way to claim progress. That it is in machine learning is a result of the fact that we have no guiding theory to drive our experiments. So people just try things to see what works.
Obviously, if someone has more resources than almost everyone else, they can try more things and hope to luck out on more interesting results. That's the reason why it's always Google, OpenAI, Uber, Facebook etc. that are in the news for "new machine learning results". They got more stuff to throw at more walls and more eyes to see what sticks.
The original transformer experiment tested the hypothesis: "if we organize layers of attention mechanisms in a certain way and feed them tons of text they will be able to process that text effectively enough to build very rich and consistent language model". GPT experiments test the hypothesis "if we feed the transformer more data its language model might become rich and consistent enough to produce human level results". To me this sounds like an well defined scientific experiment.
Using your analogy, GPT-3 is more like if you devised an algorithm which produces n + k of pi digits after processing n pi digits - without knowing anything about how to compute pi, or what pi is. To me that deserves falling over my chair in amazement.
>> The original transformer experiment tested the hypothesis: "if we organize
layers of attention mechanisms in a certain way and feed them tons of text
they will be able to process that text effectively enough to build very rich
and consistent language model".
I read the paper when it came out and I checked it again now to confirm:
there's no such hypothesis in there. In fact "Attention is all you need" is a
typical example of a post-hoc paper that describes what a research team did
that worked and how well it worked. "We tweaked these knobs and out came
STUFF!". Typically for deep learning papers it lacks a theory section, the
space of which is instead taken by an "Architecture" section wich well,
describes the architecture. There are no theorems, or proofs. There is nothing
that connects some kind of theoretical claim to the experiments. The main
claim of the paper is "we build this system and it has better performance than
previous systems". Like I say in my earlier comment, that's not an
interesting scientific claim.
I'm sorry but personally I find that kind of work irritating. "We tried some
stuff and got some results". Woo-hoo. But, why did you try that stuff and why
did you get those results? Did you try to get the same results without that
stuff? Did you try to get some other results with the same stuff? Can you
explain what is going on in your system and why it does that when I twist
this knob? If I twist that knob, can you tell me what it will do without
having to run it first to find out? Typically, 99% of the time, the answer to
all this is "no" (i.e. no ablation experiments, no theoretical explanations,
etc. etc, no nothing). It's like I say above, just throwing stuff at the wall
to see what sticks. And then writing a paper to describe it.
Oh, and calling stuff suggestive names like "attention". If I call it
"boredom", am I more or less justified than the authors?
The problem with all those advances is that they all happened more than 20
years ago. My comment discusses the state of machine learning research right
now, which is that there are very few new ideas and the majority of the field
doesn't have a clear direction.
Note also that the advances you describe were not "guided by theory". They
were inspired by ideas about how the mind works. But, finding ideas to try
is not the scientific process I describe above. And just because you're
inspired by an idea doesn't mean your work is in any way a proof or disproof
of that idea. For example, CNNs were not created in an effort to demonsrate
the accuracy of a certain model of the visual cortex. In fact, Yan LeCun is on
record saying that deep learning is nothing like the brain:
Yann LeCun: My least favorite description [of deep learning] is, “It works
just like the brain.” I don’t like people saying this because, while Deep
Learning gets an inspiration from biology, it’s very, very far from what the
brain actually does.
>> Using your analogy, GPT-3 is more like if you devised an algorithm which
produces n + k of pi digits after processing n pi digits - without knowing
anything about how to compute pi, or what pi is.
That's not a good example. GPT-3 can't actually do this. In fact, no
technology we know of can do this for a k sufficiently large and with accuracy
better than chance. Personally I find GPT-3's text generation very
underwhelming and not anywhere near the magickal guessing machine your example
seems to describe.
Not exactly a paper, but you mean something like this article:
I'm thinking more in line the type of questions you might get in competitive programming.
I'm not aware of anything else showing how effective the transformer architecture is on image tasks, and I think that is a pretty fundamental breakthrough. I realise it's not GPT-3 (it's based on GPT-2), but this seems more a project resourcing issue than being fundamentally unsatisfying.
That was a smaller model fine-tuned on Github IIRC.
1. Scientifically, synthesizing really simple loop-free programs from high-level specs is not really new. You can synthesize programs at this level of complexity with a few minutes on a single ten year old laptop using 5-10 year old algorithms.
2. Scientifically, it's already known that the architectures used in nlp models are also useful for simple program synthesis tasks.
3. From an engineering perspective, here was a good HN discussion about this a few weeks ago. IMO this comment hits the nail on the head (https://news.ycombinator.com/item?id=23256333):
> This technology converts writing code into bug hunting in pre-written code. Finding bugs in code that you did not write is way harder than writing the code yourself. So if anything, this makes programming harder, not easier, and we will need more programmers, not less.
Rings true for this example. I can write my intended implementation much faster than I can read the generated code, find the bug, and think of a comment that captures the correct spec.
It's interesting, but it's not "generating code to solve a novel programming task".
I can definitely see this tech leading to ever more clever linters though. Having an AI reading my code, interpreting the comments, variable and function names and use that to look for bugs in my implementation could be highly valuable, assuming that the number of false positives isn't overwhelming. "Your comment says that you apply the discount only on palindromes but your condition here is wrong and your code ends up doing the opposite".
I think we will have useful source code synthesizers for high-level languages at some point in the next few years. They will probably look more like alphazero than gpt-3 though.
I'm not sure that they will be commercially successful. The way you automate away programmers is Wix, not Ruby on Rails program synthesizers.
"not novel code" here was referring GP's "novel programming task", not the synthesis method. I think we're probably using different definitions of "task". Where you mean it in a very particular sense (this exact piece of code) and I mean it in a more "writing if-else blocks inside a single procedure with no loops and no recursion using functions that are in-scope" sense.
The proper way to determine if there's anything interesting here would be to run gpt-3 on some existing program synthesis benchmarks. Literally any program synthesizer can look super impressive if you just show one working example in a yt video. My suspicion is that gpt-3 isn't going to do particularly well on those benchmarks at least out of the box, and that getting it to work as well as sota would require a bunch of non-trivial engineering work.
IIUC, the Generalized Program Synthesis Benchmark Suite is still mostly unsolved, including problems like “Given three strings n1, n2, and n3, return true if length(n1) < length(n2) < length(n3), and false otherwise.”
My point wasn't that current program synthesis is particularly great, although I do think modern program synthesis tools can probably beat gpt-3 on lots of problems (and allow that the other direction is probably true too...)
My point was that I'm skeptical that GPT-3 would do particularly well on those benchmarks without lots of additional effort. And then, since you can build pretty much anything anyway with enough blood and sweat, the actual question is: would the same amount of effort poured into an alternative approach generate the same/better results but in a way that's far easier to interpret/extend?
It could work. But the yt video alone is more "huh, interesting" than "wow, impressive". If that makes sense.
I got the impression from you saying “You can synthesize programs at this level of complexity with a few minutes on a single ten year old laptop using 5-10 year old algorithms.” that you thought this was generally solved at this level of complexity, rather than merely true for an easier example here and there.
No. Actual geneticists are still pretty skeptical about GWASes because they tell us almost nothing about, well, the genetics behind complex traits. It's all good and well running GWASes for literally anything (see: twitter.com/sbotgwa or that dude who got a pretty good PGS from correlating a country's GDP with the genotypes of Arabidopsis thaliana available in that country) but that's virtually useless for serious research or if you want to know how genes work.
Actually figuring out to what extent a trait is genetically determined usually involves much more complex methods (e.g. mendelian randomization) and knock-out experiments on animal models, which is all terribly expensive and tedious. But that's how actual genetics works, not waving a magic wand of +15% heritability.
Why? You've already said so yourself. Those are expensive and tedious, and searching across the entire pool of human genetics with them is an exercise in futility.
>They're often an early step
I think we're in agreement here, I'm just arguing that "woah, look at all those correlations" isn't a breakthrough or 'genomic revolution' in any sense of the word as far as our understanding of human genetics is concerned.
Edit: if it doesn't work try this i guess? https://www.google.com/url?q=https://mobile.twitter.com/past...
>You're also wrong that the only thing of value is inferring "how genes work"
Yes it is, that's literally what genetics is about. Otherwise you're back to making a bunch of correlations. If you want a deep understanding of disease, design effective drugs, or even do proper gene editing, you have to understand what genes do. It seems ridiculous to have to say it.
>By the way, what does '15% heritability' refer to?
It is what we in the less-rationalist circles refer to as a 'joke'.
-- Human brains somehow are able to learn from all forms of sensory input and world interaction in ways that pay off in word tests. (Even then, would the information processed by an average human match the GPT-3 training corpus?)
-- The human brain has a very different architecture that lets us learn vastly more efficiently from small amounts of training data.
Is there another possibility I'm missing?
If the latter is true then that gives humans a durable advantage over GPT-3-like approaches.
Possibility: The human brain and GPT-3 are doing radically different things and aren't even comparable. GPT-3 is merely memorizing enough language to pass a turing test, whereas human brains are actually learning how to use language to communicate with other humans.
Evidence: Have GPT-3 write your work emails for a day, and then live with the consequences of that for the next week. It's going to produce text that makes sense, but only in a very particular way. You will end up with email threads where an outsider says "yeah looks like work emails. I believe this is two humans" And, that's very impressive! But your actual interlocutor who understands the full context of the conversation and relationship will legitimately worry for your mental health and maybe even contact your manager.
1. Any time you invent a test, people will find ways to pass the test with flying colors but completely miss the point of the test. The turing test is no different.
2. Being able to imitate a human well enough in a five minute general english conversation is only so useful. There's a reason we don't pay people living wages to write internet comments. This isn't to say that GPT-3 is useless, though. There is certainly demand for very specialized five minute conversers that come at zero marginal cost. I'm worried.
3. We still have no clue how to even begin to approach AGI.
But how far are we really from going beyond the five minute conversation you mention? What if you combine a transformer text model with a chess engine to produce a chess tutor? It doesn't seem like we are too far from a computer program that could teach chess from first principles (i.e. learn chess itself, and teach it in a meaningful way).
Maybe in combination with economic models, it could teach economics?
What else? Perhaps in combination with wikipedia it could teach subjects such as physics or biology? But it would just be regurgitating in that case, not "understanding" what is in wikipedia. We need an existing model to tie to the text generating capability. So what if you take an alpha-go-zero approach to model generation - instead of a human creating the model, it is developed through adversarial learning? Theoretically then, anything that can be formulated in a contest or game could be "learned" and modeled, and then combined with text generation to be taught. That seems pretty powerful, and not so far out of reach.
(Also, I love the idea of letting GPT reply to work emails for a day. Someone should set up some experiments to actually do that, and find out what happens. I bet that even though it would be a disaster, we would learn a ton.)
> Someone should set up some experiments to actually do that, and find out what happens. I bet that even though it would be a disaster, we would learn a ton.
Ha! Agreed! Unfortunately I do too much soft external-facing stuff these days to do this myself :(. Someone who interacts mostly internally with engineers/techy types who might appreciate the experiment should totally do this.
This almost certainly won't work with GPT-3, but it might with GPT-5, GPT-10, or GPT-20.
Current language models learn language from how we use it, but only in relation to more language. Hopefully, we can eventually get some good data on how we use language in relation to the real world too. I can't imagine where you'd get a massive dataset like that though, shy of having robots out in the world acting like babies.
Yes, absolutely humans learn from cross modalities. I haven't seen much work on attempting this in neural networks, but cross modal prediction is known to work well.
> Even then, would the information processed by an average human match the GPT-3 training corpus?
I'd be surprised if it didn't. The visual cortex processes around 8.75 megabits per second. Assuming eyes are open around 16 hours a day that is 63 GB/day of information just from the eyes.
Assuming 500 billion words in the GPT training set, and 5 characters per word on average and 1 byte characters that is 100 GB training set or a bit under 2 days of information for a human.
Now the two aren't directly comparable, but humans gather a lot of information.
It's not entirely obvious how to do this though. There needs to be a common training representation and that's pretty hard.
In videos, the sequence of frames is important, but each frame is an image. In text the sequence of characters is important.
Maybe something that can accept sequences (of some length..) of bytes where the bytes maybe an image or maybe a single character might work. But unified representations of images and words for training is probably the first step towards that.
based on the current state of the art and the research problems needed to solve it's probably 3 years before we are in a position to contemplate the kind of training suggested here.
Brains are also quite energy-efficient, which makes parallel search of architectures a lot cheaper. A small rodent having a handful of offspring every year is equivalent to building new supercomputers with specialized hardware beyond current capabilities on a tiny budget and then using that to train a model and seeing whether it's better. Researchers don't have that luxury so they have to compromise by running a very generic model that is flexible but inefficient.
It would be surprising to me if AGI is achievable via two such different pathways. I don't know this area so I'm willing to be surprised.
Those are not mutually exclusive. Architecture doesn't matter asymptotically but it does matter for task-specific performance for any concrete complexity budget.
Compare to Big-O behavior of algorithms (asymptotes) vs. doing hardware specific optimizations (cache-friendliness, hand-rolled assembly, outsourcing specific parts to ASICs...).
OA is doing the former, looking at asymptotes. Nature has done that too (mammalian brains scale from rodents to humans) but also has applies all kinds of task-specific optimizations. Compare the cerebral cortex and cerebellum for example, the latter acts similar to an application accelerator but if it doesn't work software fallbacks are possible albeit slower.
> It would be surprising to me if AGI is achievable via two such different pathways.
Have you considered cephalopod or bird brains?
I think this is right, but its misleading perhaps to identify this as 'efficiency' in a computational sense - we're talking trillions of extra parameters derived in the brain from the tiny fraction of training data
Input to the human (like audible words said in front of them), and input to the network (whatever data that part of the brain receives from anywhere, both from outside but also from other parts of the brain) are two very distinct things.
> gives humans a durable advantage over GPT-3-like approaches.
I think it is pretty clear to everyone involved that there are, empirically, so far, numerous very important and obvious and deep advantages that humans have over a GPT-3. This doesn't mean that GPT-2/3 aren't a tremendous achievement in AI.
Individual humans only ever observe a tiny fraction of that training data, but our brains are not the products of solely our individual experiences. We come out of the womb with a massive amount of evolutionarily trained, genetically encoded weights in our neural network.
I agree with other people in this thread that structure/architecture is encoded, but not weights.
This is really the key detail and the hole in Gwern's argument that AGI is around the corner. You can't just compare the result. You also have to look at what it took to train the model and at what the model is actually doing.
If you look at GPT-3's output, it only superficially makes sense. Is there evidence of true understanding or is it just a really really good text generator?
Regardless of AGI though, I do think that models like this will eventually mean the end of social media and perhaps all wide open discourse on the Internet. When this stuff gets easy and cheap enough that spammers and propagandists can use it, it's over. How much money in compute/storage would it take to train GPT-3 to advocate for Donald Trump or Joe Biden all day long, or to shill products, or to just generate superficially comprehensible text to execute a kind of denial of service attack on a community?
AGI might happen tomorrow; it might happen in decades, in centuries, or never. GPT-3 is basically a straightforward scaling of GPT-2, but I see no evidence that simply scaling GPT-2 or GPT-3 will lead to AGI. The problem is we don't know what else is needed.
As far as I'm concerned, I would probably pick this over a random fanfic in a Turing Test.
This reads exactly like bad plagiarism, with the word count padding repetitive phrases and too-on-the-nose facts. It's not generating, it's memorizing.
Are you also saying that it is plagiarism or that it just look like it? To be honest considering the huge training sets and knowing nothing of the testing methodology I have some lurking suspicion that it could be just copy-pasting chunks of texts...
On the other hand if it is just writing "articles that feels a lot like they are plagiarism" then I suppose it doing its job properly considering what you find on the internet.
To really evaluate these samples, we need a way to search for phrases in the training data, to see how much is just learned and copied. I've tried Google and DDG, but not found anything.
Edit: I pasted the random sample above into Grammarly's plagiarism detector and it says it "detected plagiarism" but I'm not paying to sign up just to find out what it said (also they make you sign up with your email address before mentioning that it costs money... rude!) Maybe someone with a Grammarly subscription can try it?
Edit 2: Getting off-topic but wow, Grammarly is a nest of dark patterns. Doesn't tell you you need to create an account until you've pasted your text in. Doesn't tell you you need to pay a subscription until you've created an account. Sends you a follow-up email a few minutes later if you don't subscribe. Puts the account deletion link in a different colour and location to the rest of the account settings so it looks like a footer. Swaps the styling for the 'delete account' and 'keep account' links when you do find the delete button to try and steer you away from deleting your account. After that lot I'm never giving them a cent.
and compare it to e.g.
GPT-3 has quite a lot of correct facts, manages to mix fouad belkacem and Sharia4Belgium into it, etc... I suspect this is what happens when there is not enough training data: You get a lookalike, not something new.
To be clear: It's impressive what GPT-3 is capable of, and it's very probable something better is comming in the near future.
(It's also quite amusing how a lot of these terrorists get first killed by the police and then thrown into jail for 20 years. That'll teach their corpses! )
Well, the average human probably does not excel at this metric either
"Would it be worthwhile, even if it represented another large leap in AI capabilities, to spend up to 10 milli-Manhattan-Projects to scale GPT-3 100x to achieve human-like performance in some domains? Many researchers feel that such a suggestion is absurd and refutes the entire idea of scaling machine learning research further, and that the field would be more productive if it instead focused on research which can be conducted by an impoverished goatherder on an old laptop running off solar panels. Nevertheless, I think we can expect further scaling." 
For comparison, Nvidia's "Selene" supercomputer only has 2k a100 GPUs and is #7 fastest supercomputer in the world: https://www.hpcwire.com/2020/06/22/nvidia-nabs-7-spot-on-top...
a100 GPUs are around 3x faster than v100 GPUs.
It would take some time for governments to catch up :)
2. Most of the largest super computers are owned by states.
That said, only so many actors could've secretly sourced 10K+ v100s with fast interconnect...
I don't know if you're being hyperbolic here, but if not, I consider that a massive step forward for AI.
This kind of weird characterisation that people keep bringing up is totally wrong. It's more like a savant, who can generate random stories that have no bearing on reality.
It's just a bit better than GPT-2, which spat out mostly incoherent crap.
What's interesting about this is that it seems like the approach still seems to scale, and at some point, it might make something that actually generates useful output... and the ability of the model to handle general NLP tasks is a bit better.
So yeah, it's interesting, but no, it's not a massive step forward for AI, in the way having an actual 3rd grader would be.
And additionally he can draw, do image recognition, run circles, climb trees, pick fruits and mine rare minerals. Seems like we already have the business proposition of most AI businesses!
Ouch, we've all just been burned by an AI!
That's actually just a particular special case of how this model is used, where they use the model to predict "the rest" of a story. It is not the nature of the model itself, and it can be used for other things, such as compression or other NLP tasks.
Yes, otherwise a markov chain would be reasoning, too.
> "10 + 10 = 20. 20 + 20 = ___"
You can do the same in Prolog and similar languages. Is the language/the compiler reasoning?
> All of these go far beyond basic syntactic concerns like putting subjects and verbs in the right spots.
I'm not sure about that. The output obviously is a wonderful achievement, and will find a multitude of applications, but isn't it technically still a stochastic model describing the probability of a sequence of events, albeit at unprecedenced scale?
Reasoning needs conscious thoughts, sentience, awareness, and we don't have the slightest hints these are present in GPT. Yes, humans reason about something and then write a coherent text, but that doesn't mean that the presence of a coherent text is proof of reasoning - just as the absence of the capability to produce coherent text isn't proof of the absence of reasoning (e.g. in the horrible cases of the locked-in syndrome https://en.wikipedia.org/wiki/Locked-in_syndrome )
Yes, absolutely. The Prolog interpreter is a resolution-based theorem prover.
You will be very hard pressed to find any AI researcher arguing strongly that
"that's not reasoning". In fact I believe automated theorem proving is one of
the very few cases were you can find so little disagreement about whether
something is "reasoning" or not.
Note also that "reasoning" is not necessary to solve a problem like "10 + 10 =
20. 20 + 20 = ___" (1). Knowlege of addition is sufficient: given knowledge of
addition the result of "20 + 20" can be derived without reference to "10 + 10
= 20". And, absent knowledge of addition, even if a system can answer (1), it
will not be able to answer the majority of similar problems, indicating that
it has only memorised the answer to (1).
A better test of what a system has learned is a question like "10 + 10 = 30.
20 + 20 = ___" (2). The answer to that should still be "40", but again that's
not because of any reasoning; it's because "20 + 20" is always "40", even when
preceded by a false statement. So this kind of question is really not any way
to test reasoning abilities.
Edit: Actually, "Alice was friends with Bob. Alice went to visit her friend
___" (3) is not a very good test for reasoning, either. If I were to answer
"Alice", would you be able to say whether that's true or false? The only way
to make such a decision is in the context of a "closed world assumption"
(incidentally, central to Prolog's theorem proving). However, now you're
making a much more precise claim, that "GPT-3 has learned to answer questions
by making a closed-world assumption". You can test this claim much more
convincingly by asking questions like "Alice is friends with Bob. Is alice
friends with Alice"? The answer should be "no" (or "false", "incorrect", etc).
Has this kind of more formal test been carried out, with GPT-x?
Yes, of course. Reason isn't magic.
Many linguists think that every human language follows fundamental patterns (e.g. ). In that context, the achievement of GPT is that it indirectly derived such a model by working through ungodly amounts of data. The results sound meaningful for us - but that doesn't imply that GPT intended meaning.
Every theory of reason I know has consciousness as a hard requirement. I'm not trying to be pedantic, but the topic of this thread is exactly the kind where clear definitions of words are important.
If Prolog is reasoning, then a scientific calculator is, too. But now we just need another word for the thing that differentiates us from calculators.
If I think about a process using a certain mechanism, and the AI thinks about a process using a similar mechanism, but also I have a consciousness attached on top and the AI does not, then it seems petty to assign these processes different labels based on a component whose mechanical relevance is not shown. I'm not doubting the impact of conscious, reflective reasoning on human capability, mind! But most of the thinking I do is not that.
Also as a general rule, you should be skeptical of considerations of reason that are based largely on introspection; the process is inherently biased towards consciousness as a load-bearing element, since consciousness is so heavily involved in the examination.
Remember that logic had to be invented by Aristotle. It was a mechanical system meant to approximate how humans make decisions naturally.
In my opinion, in some regards the AI performs well above a third grade level in some areas (being able to write a complex paragraph, terminal completion, writing code) and below in some other areas (not knowing basic multiplication, doesn't understand spatial context).
If we can make an AI with an intelligence of a human, we can train that to be very good at some problem (like a human expert in some area). And then we can clone those intelligences into thousands in order to progress in that area much quicker.
A human-level intelligence is also a very important goal because it, even by definition, produces the singularity point. (AIs that can make better AIs, exponentially.)
> We don't care if it can't make a decent cup of tea.
We absolutely do, although it depends on the price at which they can do that. But once an AI can do that at all, it is only a matter of time before we can make them do it cheaply (we are very good at making things cheap once we know what things we need), at which point it becomes a driving force in our progress. Once there are self-driving cars, there will be a revolution in the labor force, for example, and driving is only a small portion of what a "human-level" intelligence is capable of.
But, on the way to AGI - which seems to me a long way off, despite some of the interesting arguments made here - it's not really true that "being very humanlike" is a big commercial advantage. As long as we have (say) less than the intelligence of a 2 year old child, or a chimp, then we'd rather have computers be doing very un-human things.
You could of course find cases where this term makes other sense (or does not at all), since English is a flexible language, but I think that in areas where we obviously discuss AI/ML, let's just use the de-facto term and make everyone's lives easier.
A common quip, but deeply inaccurate. Making productive agents who will advance your community is not cheap (neither in money nor in effort), and while some parts are fun, many parts are extremely not-fun. Productivity is clearly part of the goal of AI.
If the best AI you get is one that thinks like a 4-month-old fetus, making those is cheap and fun.
If that's your perspective, that would be considerably staggering progress.
There is an implicit goalpost for AI that laypersons have that assumes if it can't kill us all then it's not "real AI."
Who are these experts? Where are records of these routine statements? Seriously, I am an AI researcher, who said this?
"Deep learning (does not) scale... We can't just scale up AlexNet and get respectively better results - we have to fiddle with specific architectures, and effectively additional compute does not buy much without order of magnitude more data samples, which are in practice only available in simulated game environments."
But Gary Marcus is the gold standard. Check out his 2018 take-down attempt, which is clearly mostly wrong:
GPT-2 was critiqued for its inability to deal with simple arithmetic questions, as well as contradicting itself over long periods .
GPT-3 has similar deficiencies in terms of sensible real-world knowledge and displaying coherence both with its own previous answers and with a more general real-world knowledge:
'Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type “If I put cheese into the fridge, will it melt?”.' ) of the type that would obviously torpedo attempts at open domain dialogue over long periods.
Take a typical two paper pointing out the failings of brute force methods for dialogue. They trot out a classic example:
'Without getting into the specifics of how these systems are constructed, it is apparent that such an approach [brute force] provides a method for imitating a dialogue but not for participating in one. We can see the limitations of the approach from examples like these:
Human: what is your job ? Machine: i ’m a lawyer .
Human: what do you do ? Machine: i ’m a doctor
(Vinyals & Le, 2015).' 
Huge models can provide plausible, but not consistently coherent, dialogue inputs. In the dialogue domain, the 'winter' will come (/has come) when it becomes clear that Meena, BlenderBot etc need a little help when it comes to coherence over an arbitrary number of turns, displaying 'common sense physics', and so on.
Spectrum: Hype is bad, sure, but why do you say it’s “dangerous”?
LeCun: It sets expectations for funding agencies, the public, potential customers, start-ups and investors, such that they believe that we are on the cusp of building systems that are as powerful as the brain, when in fact we are very far from that. This could easily lead to another “winter cycle.”
To clarify, that's not to the credit of the article. The author is basically taking the piss off people like LeCun, sarcastically describing them as "eminent, respectable, serious people" who "spoke in considered tones" (as if that's a bad thing) and wondering why they haven't issued a "mea culpa". At least, I find it a bit conceited to expect the people who built up a field of research from nothing to apologise for being worried that the field might be in danger from overhyping by a flood of newcomers who don't understand it.
I’m not sure this comparison holds.. seems like a chain of fuzzy implications taken as necessary fact.
I think we get to general AI by using prior knowledge, exploit deep learning, and also build systems that can develop their own models of the world that they can maintain, modify, discard, and combine.
As always, a great write up by gwern!
If we can build something capable of passing Winograd schema, then it can probably write working non-trivial computer programs from plain text.
Google's PEGASUS summarization model has learned to count up to five (which is amazing!!). That's "only" 568M parameters. It's be interesting to see GPT-3 fine tuned against the PEGASUS objective function.
>Following this post is an example article from the XSum dataset along with the model-generated abstractive summary. The model correctly abstracts and paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall) as “four Royal Navy frigates”, something an extractive approach could not do since “four” is not mentioned anywhere. Was this a fluke or did the model actually count? One way to find out is to add and remove ships to see if the count changes.
>As can be seen below, the model successfully “counts” ships from 2 to 5. However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as “seven”. So it appears the model has learned to count small numbers of items in a list, but does not yet generalize as elegantly as we would hope. Still, we think this rudimentary counting ability is impressive as it was not explicitly programmed into the model, and it demonstrates a limited amount of “symbolic reasoning” by the model.
It surprises me a lot more than the excellent performance of GPT-3 on text generation for example. GPT-3 is amazing but looking at GPT-1 -> GPT-2 -> GPT-3 it isn't surprising. Counting on the other hand is something I wouldn't have expected from a summarizer.
But to me that isn't as surprising. Not claiming I would have thought of it, but if you have a very large multi-dimensional space (such as GPT-3) then giving it some examples of something pushes it into that general area of the space.
Generalizing concepts isn't a new thing - one could argue that word2vec from 2014 did that pretty well. GPT-3's "concepts" are vastly more complex than the single word (or maybe 2 word) concepts in Word2Vec though.
I'd love to see an architecture that can keep a separate short-term memory to allow it to count with multiple digits and follow algorithms. On the other hand, given what we've seen from GPT, at that point I would actually worry about it becoming a general intelligence...
But how would that work?
I agree it probably doesn't "understand" math, but it has learned that number words can substitute for each other in a sentence (three ships/four ships/five ships) which isn't surprising.
But it has somehow learned to link that word with the correct length of the sequence of names, which is astonishing. I can't think of obvious "cheats" that make this work.
The best I can think of is that is has learned to count commas when they are separated by words.
To me it just seems like what supercomputing is to normal computing: It makes the computationally expensive stuff do-able in a reasonable amount of time, or gives diminishing returns on existing algorithms. But it doesn't magic in any real advancements.
The problem in AI/ML and the concept of "AI winter" to me was always the barrier of the fact that we're just doing predictions, with no deep meaning or comprehension of features. The layman thinks there's magic, but there's not, so when that truth is revealed there will be problems. There's nothing intelligent about our Artificial Intelligence; we're just doing statistics with big data and added sprinkles. OpenAI just proved they could do statistics with even bigger data and more expensive sprinkles.
Has their work really shown we can get past that core problem? Personally, I don't see it.
I mean, you could pretty much say that's how the human brain works, couldn't you?
As is often the case, the truth is somewhere in the middle. It’s almost certain that we won’t reach AGI without a fundamental breakthrough on soft/wet-ware but it’s also nearly certain that even with the best algorithms we will need to efficiently harness and coordinate massive compute power, as we’re learning to do now.
You might argue back, well the human brain has pre-trained neural networks with billions of hours of training time. Well, that isn't really the case. We don't start off with some pre-existing memory of what "love" means, or what "Physics" is, or trillions of bytes of data. All we have is a capacity to learn which is highly efficient, a conscious mind which is aware of itself, and certain fundamental drives driven by our bodies and instincts. If you have a human child and give it zero input information it will never learn a language or be capable at all in any sense of the term. So we become incredibly capable based on a tiny fraction of input data fed into us after birth.
The way the human brain and mind works is deeply tied in to the experience of having a body, knowing we are mortal, and having fundamental drives such as a drive to survive, eat, drink, keep ourselves safe, and also a drive to be social, find a mate, and procreate. I would argue that we will never be able to have a computer/algorithm that thinks like we do unless it also has drives like we do and a body like we do, since so much of our process of thinking is tied in to having a body, our awareness of mortality, and our basic human drives and experience.
B = E+F
Love = X = A+B = C+D+E+F
Obviously the above is contrived and abstracted, but you get my point. If I took a little bit of time, I can schematically map every word to makes up the definition of love, and how they interact. Then I can associate 3D, real world graphical observations to each of those words and then love as a concept holistically (as humans do, we're not just confined to text data, we observe extremely rich 3D visual data, and audio data, and touch data, etc...). There's no reason to believe a massive "correlation machine" can't do the same thing with the right algorithms, enough compute power, and multimodal inputs. Furthermore, we can make the correlation machine even better by specializing parts of the hardware for certain tasks, just like the brain.
I know of the general idea of consciousness, but I can’t boil it down to first principles. Self-awareness, on the other hand, is more tangible. AI would seem capable of internal cognition, reflection on past experiences, etc...They might not have the desire or need for such reflection, but they would certainly have the ability.
So sure, when we learn how to walk or see or read, we may be in this mode of simply discerning patterns in large amounts of data. But when we learn maths or programming, we are using a completely different kind of learning.
For instance, there is no continuity of small changes in height that will bring you from climbing up incrementally larger trees to climbing to the moon.
The question is whether gpt-3 vs human intelligence is more like climbing a tree vs climbing a mountain or more like climbing a tree vs building a rocket.
> the fact that we're just doing predictions, with no deep meaning or comprehension of features
What - specifically - do you mean by this?
Edit: Nevermind, my browser seems to just be screwing up on Gwern's footnotes in general at the moment.
Had never heard of MuZero before. Its impressive that it can reach AlphaZero levels in Go without knowing the rules.
I think it will be able to (imperfectly) do things like the following:
- OCR from images
- Textual descriptions of images
It may start to make some progress towards things like:
- Generating images from a textual description
- producing structured documents (eg HTML) from document images
It'd be interesting to see how far along they already are with this.
What is the minimum hardware required to run this locally?
Or what's the cheapest way to run this model (even at barely acceptable performance) at a cloud provider?
When is someone going to advance pooling techniques? We desperately need improvement!
Scaling a model is just like it sounds: more data fed into a bigger network with more parameters. The gist of what this article is saying about scaling is that there's no sign of diminishing returns yet in terms of what the network can do and how well it generalises as the number of parameters is increased: the "more parameters = better performance" trend continues up to the enormous size of the full GPT-3 model, with no indication that even bigger models won't have even better performance.
Here is the GPT-3 paper: https://arxiv.org/pdf/2005.14165.pdf
If you really want to understand, skim this, and focus especially on the graphs, as they show the scaling. The x axis is usually model size, and the y axis is mostly accuracy or "loss" (~error).
It doesn't mention exploding the earth, and while there is a little ambiguity, as Gwern does imply, they are describing a recent paper that concludes that large fission reactions are simply impossible: "They found instead, that instead of building up to a grand climax, it runs down and stops like an unwound clock."
The final line of the caption is 'Readers made insomnious by "newspaper talk" of terrific atomic weapons held in reserve by dictators may now get sleep'. At least superficially, that sure sounds to be more about atomic weapons being impossible than about whether the chain reaction would consume the entire earth.
I think you are confusing the actual linked article with Edward Teller's later argument that a nuclear fission explosion might ignite the atmosphere: https://www.realclearscience.com/blog/2019/09/12/the_fear_th....