
Turing-NLG: A 17B-parameter language model - XnoiVeX
https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
======
corporateslave5
People are vastly underestimating the changes that are about to come from NLP.
The basic ideas of how to get language models working are just about in place.
Transformer networks, and recent innovations like GPT-2, googles reformer
model, etc are precursors to the real machine learning boom. Machine learning
as we have known it, has been stuck as an optimization tool, and used for
computer vision here and there. NLP, and with it, the ability to create,
synthesize, and understand content, will change the internet.

More than that, I think NLP will unlock new ways of interacting with
computers. Computers will be able to handle the ambiguity of human language,
transcending their rigid “only do exactly what you tell them” models of the
world.

Edit:

Adding this to give more technical context. I think most people don’t know
where the line is currently between what possible, and what’s not, but also
what we are on the cusp of. And we are on the cusp of a lot.

A quick explanation of one area is here:

Basically, transformer models are the best for NLP. They use something called
attention based mechanisms, which allows the model to draw correlations
between pieces of text/tokens that are far apart. The issue is that this is an
O(n^2) operation. So the model is bounded by the context window, which is
currently mostly at 512 tokens, and is thus, bounded in how much it can
understand. Recent innovations, and further study, will broaden the context
window, and thus unlock better reading comprehension and context
understanding. For instance, the ability to answer a question using a piece of
text is mostly stuck at just finding one paragraph. The future will see models
that can find multiple different paragraphs, understand how they relate, pull
the relevant information, and synthesize it. This sounds like a minor step
forwards, but its important. This will unlock better conversational abilities,
but also, better ways to understand how different pieces of textual
information relate. The scattershot of information across the internet can go
away. Computers can better understand context to act on human intention
through language, unlocking the ability to handle ambiguity. This will change
the internet.

Again to empathize, these models only started showing up in 2017! The progress
has been rapid.

~~~
joe_the_user
_Computers will be able to handle the ambiguity of human language,
transcending their rigid “only do exactly what you tell them” models of the
world._

So, are reasonable examples now of these models allowing semantic context? So,
far, what I have seen is generated text where the lack of _understanding_
takes three paragraphs to become obvious rather than one.

Human language is this marvelous framework involving symbols associating with
other symbols as well as to well-known and vaguely-guessed facts about the
world.

Human relations are very robust and, for example, two people can have a
longish conversations where at the end, they realize they're talking about two
different people (or different days or events). But in those circumstances,
they can correct and adjust. "Solid" understanding is there but it's under a
lot of layers of social cues and protocols and multiple meanings.

~~~
mumblemumble
> So, are reasonable examples now of these models allowing semantic context?

This is about where I am stuck. I'll start believing that we truly are on the
cusp of a revolution as soon as I see Google Translate reliably knowing when
to translate "home" into French as "domicile", "foyer", something those lines,
or as "accueil."

Right now it seems to very frequently choose "accueil", which is generally
wrong, except when you're talking about websites and software user interfaces.
That it's biased so strongly toward that error speaks volumes about how
critical semantics are to sorting out natural language, and also about how bad
current NLP systems are at dealing with semantics.

~~~
joe_the_user
Syntax and semantics were developed for human language, yet it's much easier
to puzzle the difference in a computer language than in human language. With
syntax and semantics so wrapped together, however, it kind of seems like you
can go a long way with just capturing syntax, rhythm, word choice and etc.
Which is to say the semantic side can be even worse than it seems, ie,
nonexistent.

~~~
mattkrause
A long way for what though? The unicorn story is neat, but I don’t have a
giant unmet demand for rambling.

------
saurkt
One of the team members from Project Turing. Happy to answer any questions.

~~~
hatsuseno
How close do you think the technology is to answering -this- question?

~~~
throwawayhhakdl
1) How close do you think the technology is to answering -this- question?

Four days!

2) How long in years?

Three years!

------
rjeli
I have been bearish on AGI, but GPT2 surprised me with the lucidity of its
samples.

My take from the past few years is that we're 99% done with the visual cortex
- convolutional nets can be trained to perform any visual task a human can in
<100ms. Now I'm mostly convinced that GPT2 has solved the language cortex, and
can babble as well as we will ever need it to. We just need a prefrontal
cortex (symbolic processing / RL / whatever your pet theory is) to drive the
components, which is a problem we have not even started to solve. I am 90%
sure it is a different class of problem and we won't knock it out of the park
in 5 years like the visual/language cortexes, but we can hope.

edit: it's possible cognition follows from language, which would be
convenient. is GPT2 smarter than a dog? I don't think so but I could be wrong
¯\\_(ツ)_/¯

~~~
kragen
I have been bearish on AGI, but GPT2 surprised me with the lucidity of both
paths. I still maintain my support for the basic metric of the GPT-I. However,
I have a number of requirements on how my proposal is to be funded to resolve
concerns. First, I strongly believe that academic research should be the
method of choice (that is, if we are to figure out how to make AGI possible),
and I advocate funding to support results from the central bank community.
Second, given that the GPT can be articulated in mathematical terms, this
should be reflected in funding policy. A very serious concern is that if
funding of GPT is disincentivized, investors may react similarly to the way
they reacted to AGI. This is

~~~
bonoboTP
You generated this with GPT, right?

~~~
kragen
Yes.

~~~
1ris
While Markov chains sound like a schizophrenic, this sounds like a banker on
coke. Pretty impressive progress, i guess.

~~~
kragen
That's because of the word “bearish”. It can sound like just about anything.
For example, prompted by your comment, it sounds like a human interest
journalist. (Journalism and fanfic, in particular HPMOR, seem to have
comprised a large part of its training corpus; it composes Harry Potter porn
at the slightest provocation; “Hermione moaned”, say.)

While Markov chains sound like a schizophrenic, this sounds like a spacial
disjointed notebook, as if somebody was trying to write in two places at the
same time. "Today is Monday, it's Saturday night, I forgot to write to my dad
and he only leaves the house for a couple of hours."

Yet, sometimes Markov chains turn out to be the most beautiful art form. His
wife Jennifer Neil, whom he met at a barbecue and has been married to since
1998, attributes this creative process to the constant ups and downs in his
old job.

"His numbers are just insane," says Jennifer Neil. "I don't know where he
keeps them, but they're very mind boggling."

Somewhat disconnected from the actual

------
eyegor
I've always been interested in techniques to try to minimize parameters or
alternate approaches to learning. Meanwhile, state of the art is over here
just finding clever ways to make everything bigger. I have a feeling we're
going to end up with a very different landscape in 5-10 years, much like the
automotive industry never started mass producing inline 12s and instead moved
to turbos and superchargers.

------
freediver
I can uderstand announcing this without code, but without a demo so anyone can
try it in different scenarios?

~~~
saurkt
If you want access, please send an email to [turing_ AT _microsoft _DOT_ com].
Remove underscores and spaces.

------
0xff00ffee
B = Billion, not Byte. For second I was like, WTF?

~~~
ngcc_hk
I thought I am the only one. There is no trigger to deep learning.

But the article is fascinating nevertheless. Not sure is alphago breakthrough.

~~~
make3
not at all comparable. it's just a scaled up GPT2, no new ideas deep learning
wise.

------
ragebol
All these language generation models, in short, base their next word solely on
the previous words, right? I'd expect that these generators can be conditioned
on e.g. some fact (like in first order logic etc) to express something I want.
This is roughly the inverse of for example Natural Language Understanding.

Does anything like this exist?

~~~
gchq-7703
I'm fairly sure that these models don't work solely on the previous word, but
instead are able to remember some level of information from history.

Otherwise, you'd reach a word like 'and' and couldn't possibly follow it with
a logical statement that follows on from the previous part.

~~~
ragebol
This is why I said 'words', multiple :-).

My point being that these generation models should be conditioned on something
more than just word history, like something they want/are instructed to
express.

------
lowdose
This does GPT-2 X 10. For anyone wondering what GPT-2 is doing look at this
baffling subreddit and marvel at how one GPT-2 model trained for $70k spits
out better comedy than everybody on the payroll of Netflix combined.

[https://www.reddit.com/r/SubSimulatorGPT2/](https://www.reddit.com/r/SubSimulatorGPT2/)

~~~
speedgoose
It's definitely better than the original using Markov chains. It fits very
well this use case, and in my opinion only this use case.

GPT2 is still very random and quite stupid.

You start it with your love for your girlfriend as a context, she becomes a
cam girl into hard core anal two paragraphs later. You start with religion,
"Muslims must be exterminated". You start with software and you get a
description of non existent hardware with instructions about how to setup a
VPN in the middle. You start with news, and you can read than China supports
the Islamic state.

That's cool because it has more context than Markov chains which usually have
only 3 words of context, but it's still a long way to go before I trust
anything generated by this kind of algorithm.

~~~
dilap
This stuff is pretty much indistinguishable from the real thing...

[https://www.reddit.com/r/SubSimulatorGPT2/comments/f1pypf/so...](https://www.reddit.com/r/SubSimulatorGPT2/comments/f1pypf/socialists_why_do_you_think_capitalism_is/)

------
Tenoke
I expect we'll see some very interesting, very big models following it. I
didn't dig too far into the code but the library looks very easy to use and
will open up a lot of doors for people who have a few or a few thousand GPUs.

~~~
ghawkescs
I must be missing it, where did you find a link to the code?

~~~
Tenoke
The code for the distributed training library, not the model -
[https://github.com/microsoft/DeepSpeed/](https://github.com/microsoft/DeepSpeed/)

------
tuxguy
[https://news.ycombinator.com/item?id=22291417](https://news.ycombinator.com/item?id=22291417)

------
FlyingCocoon
At what stage of throwing compute & data at the problem, diminishing return
sets in?

------
bitL
What GPU do I need to train it? Titan Mega RTX with 240GB of RAM?

~~~
slashcom
A DGX-2 will do just fine.

------
danharaj
If your model has 17 billion parameters, you missed some.

------
01100011
How long until the language models stabilize enough that we can bake them into
a low-cost, low-power chip for edge uses?

~~~
foota
I think this is largely unnecessary, can't things like TPUs handle the
inference?

~~~
the8472
Putting all your speech/text onto cloud machines runs counter to e2e encrypted
messaging.

~~~
foota
I think you can get hardware like a TPU for consumer products?

------
galkk
Those summaries look impressive, although a bit repepetive

~~~
RobertDeNiro
What they don't tell you is that these summaries are always hand picked from a
few that were generated.

~~~
XnoiVeX
Quite possible but that also means that there is an opportunity to implement
some sort of RL to choose the best possible summary.

------
riku_iki
unfortunately they abstained from participation in more popular SQuAD and Glue
benchmarks..

~~~
octbash
Those are question-answering and language-understanding benchmarks
respectively, neither of which has been suitable for language generation mode
evaluation since GPT-1 was roundly beating by BERT. GPT-2 didn't evaluate on
them either.

