
GPT-2: 6-Month Follow-Up - xcodevn
https://openai.com/blog/gpt-2-6-month-follow-up/
======
gambler
_" Cornell University is studying human susceptibility to digital
disinformation generated by language models."_

 _" The Middlebury Institute of International Studies Center on Terrorism,
Extremism, and Counterterrorism (CTEC) is exploring how GPT-2 could be misused
by terrorists and extremists online."_

 _" The University of Oregon is developing a series of “bias probes” to
analyze bias within GPT-2."_

But apparently no university studies the social and economic impact of using
terabytes of public data to train algorithms that for all practical reasons
end up being inaccessible to an average person.

If things go on the way they're going right now, in 20 years millions of
people will be "mechanical turked". Most of information processing tools will
be mediated exclusively through companies like Google and Amazon. They will be
less like normal tools (e.g. word processors) and more like systems you have
to be a part of. Can you imagine the levels of inequality involved? The hyper-
centralization of power? _This_ is the foremost challenge presented by AI, not
some hypothetical nonsense involving terrorists using a text generator.

And it's not like there aren't any solutions. Douglas Engelbart, for example,
pointed out a great way of introducing technology into society without
screwing most of the society over:

[http://dougengelbart.org/content/view/138](http://dougengelbart.org/content/view/138)

We kind of followed his vision for a while, with good results, but AI seems to
be going in an entirely different direction.

~~~
newhaus1994
We (the Middlebury Institute's CTEC) are an extremism and terrorism research
lab, and so we're tracking the ways that tech is used by terrorists and
extremists.

For a lot of nonstate orgs with sophisticated propaganda arms, an
ideologically cohesive text generation capability would be a huge advantage in
scaling up info ops. We are looking to measure whether or not GPT-2 or other
neural text generators are useful for this, or if that risk is, as you say,
nonsense.

~~~
40four
I think their point isn't that terrorists leveraging this tech not a problem.
It is certainly a problem. But the greater problem being a few large entities
being the only ones who have access to or control over it.

I think it's pretty clear that terrorists or any other bad actor will find
great value & utility in this tech. The article from OpenAI says 'Humans can
be convinced by synthetic text.' & research at Cornell found people find it
almost as convincing as New York Times articles. I would be interested in
learning about the methods you guys are using to determine of it? I wonder how
that could be measured?

So let's assume the answer is 'YES! This technology is dangerous". The
Middlebury program, Cornell, and more and more universities and research
groups find the same thing. Then what will the recommendations be? Certainly
not to release it into the wild. I think they will be to keep it locked up. To
keep it in the hands of a few large and powerful companies, with the resources
to 'manage' such a thing.

This seems to be what the original comment is trying to illustrate, and I
think it's an interesting point to consider the implications of long term. The
tech exists now. There is no going back. So is it worse to let it out of the
box, or to let a but a few have control over it?

~~~
newhaus1994
In spite of all that we're studying wrt abuse potential, I (and my team)
generally support open-sourcing tech, and I hope that we can contribute not to
"oh this is dangerous, don't release" but rather to "oh this is dangerous,
it's already released, what are we going to do now?"

~~~
40four
Great, keep up the good work! Are you able to discuss how studies like yours
work? Is it along the lines of determining if people can distinguish between
human written and AI generated text? Sounds like a difficult question to
answer.

I suspect they will release the full model in time. It's already trending in
that direction.

------
revel
The OpenAI approach to managing the release of the larger dataset strikes me
as totally flawed and upside down. The biggest concern the team seem to have
is that the fully trained GPT2 model will be used to spread propaganda and
misinformation. They also imply that the biggest hurdle to training a similar
model is money needed to pay for the training resources.

The problem with this approach is that the users most likely to be malicious
users of GPT2 are state actors. China, for example, _already_ spends millions
on an immense propaganda factory. Money is not a serious obstacle for a state.
Given that other research entities are, by the sound of things, already far
along with development of similar models it seems unlikely that China and the
US don't already have functional models internally.

On the other hand, legitimate business and research is clearly hamstrung by
withholding the full model. What we have is the maximum degree of
inconvenience and the minimum degree of security. It feels almost perfectly
analogous to ban on liquids in airports. The motivation for that ban was that
existing security measures couldn't detect liquids, but simply announcing a
ban was to be enforced didn't change the fact that liquids were undetectable.
Instead millions of travelers were pointlessly inconvenienced at great cost.

Release the kraken already!

~~~
jcims
I’d recommend re-reading the original GPT2 announcement, particularly this
section regarding their release policy:

 _This decision, as well as our discussion of it, is an experiment: while we
are not sure that it is the right decision today, we believe that the AI
community will eventually need to tackle the issue of publication norms in a
thoughtful way in certain research areas._

This release approach is an experiment used to force the conversation around a
release strategy before we actually and unambiguously need it.

------
minimaxir
For finetuning GPT-2 on custom text, my gpt-2-simple package
([https://github.com/minimaxir/gpt-2-simple](https://github.com/minimaxir/gpt-2-simple))
gets close to going OOM when finetuning the 345M model, even on a 16GB VRAM
server GPU. _Doubling_ the size of the model with the 774M model might cause
it to not work at all, so I’ll need to test.

Of course, the default output from the model might be sufficient, although
it’ll take twice as long to generate text compared to the 345M which is slow
even on a GPU.

How exactly the large GPT-2 models are deployed is a mystery I really wish was
open-sourced more.

~~~
gwern
I've already tried training with nshepperd's codebase. Sampling works, but
even with the memory checkpointing and freezing the embedding and using SGD
rather than Adam, it OOMs on a 1080ti's 11GB. Either additional tricks or CPU
training are going to be required.

~~~
newhaus1994
I'm the lead researcher on the Middlebury Institute project looking at fine-
tuning the bigger models, and I got OOM on 745M and 1.5B originally. I had to
get an Azure instance with 24GB VRAM to handle it (using nshepperd's
codebase). It works, but takes a while (~500 epochs takes 12 hours on a 100k
word training dataset).

~~~
gwern
Ouch! So 11GB is nowhere close to being enough, then. I wonder if even
switching to FP16 will be adequate?

~~~
newhaus1994
Might be able to get 745M down to work on a single GPU. I'm definitely not
using all 24GB, so fp16 might be able to get it down enough.

~~~
sdan
How would you use fp16 to get it to work on a single GPU? And if you did, what
GPU should you use?

------
The_rationalist
<rant> Are there any real use case for GPT-2? Does it solve any problem? I've
read almost all state of the art leaderboards of all Nlp tasks of
paperswithcode.com and truth is except text generation, openAI has not one
state of the art, they are not even visible in leaderboards. OpenAI is maybe
the AI research center with the biggest funding and comparatively to other
well known (Microsoft, Facebook, Google or even zalando..) they are the ones
with the least results.

From my observations most SOTAs come from chineses researchers by far,
followed by deepmind.

BTW isn't that a sad truth that not even one of all major AI actors has a
draft of an AGI architecture, something comparable to CYC or opencog.
[https://wiki.opencog.org/w/CogPrime_Overview](https://wiki.opencog.org/w/CogPrime_Overview)

Two other observations I would like to share: Many important NLP tasks have
almost nobody publicly working on them it seems, on paperswithcode.com or NLP-
progress (from github) some tasks have only one or two papers... And many
others have not evolved since 2016. Most of the time it seems trivial to beat
the old state of the art, just use BERT or XLnet on a task where nobody
applied it before and hop, free state of the art for you! Yet researchers
don't seems to chase those low hanging, high returns fruits. Also researchers
seems to work a lot in isolation, many new generic improvements like new
optimizers (RAdam for example) and new activation functions (Swish) allow to
beat most of older state of the art on almost all task just by using them. Yet
researchers will take years before using them because of an absurd inertia.
Also unlike an open source program, BERT and XLnet have very low response and
activity on github despite major open issues... </rant>

~~~
p1esk
_Many important NLP tasks have almost nobody publicly working on them_

Well, then perhaps you should go work on them, instead of ranting here.

~~~
The_rationalist
Why the ad hominem? I am pointing a problem of allocation of ressources on the
AI research field. It's not to me to fixe that, but yes I am actively working
on a logical fallacies detector which is the first of human history and works
for the 256 possible forms of syllogisms, I'm expanding it to other logical
forms such as modus ponens/tollens.

~~~
p1esk
_It 's not to me to fixe that_

There's nothing to fix. People work on what they want to work on. Things that
seem important to you are not important to me, and the opposite. I'm OK with
that.

~~~
The_rationalist
"People work on what they want to work" ideally yes, but ultimately they work
on something that please them AND that give them a decent salary. Funding
should not go to fun (but useless in the real world) Nlp tasks. "Things that
seem important to you are not important to me, and the opposite." and here's
go relativism or the abandon of thought... It's indeed difficult to quantify
cardinally the utility of an NLP task against an other, but we can agree on an
ordinality (order of magnitude) E.g do you understand that POS tagging or
dependency/constictuency parsing are angular tasks needed by much of the
others. Thus making them the most important NLP tasks as they enable other Nlp
tasks and are the most used in practice? You think that what exactly is more
important? Are you talking about text generation? Why is that important?
Something important enable to solve important problems in the real world. How
text generation solve any real world problem is beyond my knowledge. But if
you rationally think that it's more important that angular Nlp tasks, you can
probably explain why and give an example or two? Yes, an AGI will need to emit
text just as humans do, indeed. But before that she needs to understund the
natural language before emitting it. GPT-2 maybe capture an aesthetic of the
initial input pretty well but it does not generate meaningful sentences or
only by accident, so no GPT-2 does not advance the quest to create an
intelligent agent mastering natural language.

~~~
p1esk
_do you understand that POS tagging or dependency /constictuency parsing are
angular tasks needed by much of the others._

I'm not sure. I rarely have to do that explicitly in my head. Perhaps a model
should learn to infer/guess them implicitly, from context, just like I do.

 _what exactly is more important?_

In my opinion, having a world model (for common sense) and situational
awareness (e.g. through sensor fusion, or from prior conversational history,
or using some externally supplied conditioning) would be far more important.

 _GPT-2 does not generate meaningful sentences or only by accident_

You think adding POS tags would help it generate meaningful sentences?

~~~
The_rationalist
_I 'm not sure. I rarely have to do that explicitly in my head._ Well I can't
prove it but I strongly believe that our brains use part of speech too,
unconsciously. _Perhaps a model should learn to infer /guess them implicitly,
from data._ That's exactly what deep learning POS tagger do, they are far
better than hard coded algorithms. SOTA has 97.96% of accuracy.

 _In my opinion, having a world model (for common sense) and situational
awareness (e.g. through sensor fusion, or from prior conversational history,
or using some externally supplied conditioning) would be far more important._
Haha you basically want a general intelligence (AGI), I want it too! And not
enough persons works on "architecting" such a thing. Opencog may interest you
a lot then. But the reality is many other "simpler" tasks are needed to make
this happen.

 _having a world model (for common sense)_ is an NLP task There are some
interesting results [https://github.com/sebastianruder/NLP-
progress/blob/master/e...](https://github.com/sebastianruder/NLP-
progress/blob/master/english/common_sense.md) OpenAI does not work on this
task sadly, at least for now.

 _You think adding POS tags would help it generate meaningful sentences?_ I
would be clearly insufficient yet necessary. I believe they already use
internally a POS tagger and a dependency parser.

~~~
p1esk
_they already use internally a POS tagger and a dependency parser._

Interesting. Where did you see that?

~~~
The_rationalist
Well it was just a belief. I may be wrong. I asked them by curiosity
[https://github.com/openai/gpt-2/issues/168](https://github.com/openai/gpt-2/issues/168)
So we will know.

~~~
p1esk
How do you think it could be used there? A separate model just for providing
tags, or the same model but trained to predict tags as well?

~~~
The_rationalist
I was imagining using a separate model just for providing tags as they are
very accurate. It would theoretically give gpt-2 useful data.

GPT-2 has not (yet) been trained to predict POS tags to my knowledge, nor
BERT, or ernie 2 or xlnet has, but I think they have great potential to
improve POS accuracy.

------
gambler
Hopefully someone will make a working demo of it, like Adam King did for 345M.
People should be able to experiment with this stuff without relying on the
hype of press releases:

[https://medium.com/@VictorBanev/interrogating-
gpt-2-345m-aaf...](https://medium.com/@VictorBanev/interrogating-
gpt-2-345m-aaff8dcc516d)

Not sure why open AI doesn't do this themselves. That fully aligns with their
stated mission.

~~~
minimaxir
It appears TalkToTransformer has been updated for 774M:
[https://twitter.com/AdamDanielKing/status/116387950071694131...](https://twitter.com/AdamDanielKing/status/1163879500716941314)

------
zitterbewegung
I was able to take all of Donald Trumps tweets and using GPT2 to make a
program that would mimic his tweets.

I found that it might be very effective. I have the test at

[https://docs.google.com/forms/d/1p7tlobl5y5plBCu_enK4KawR7B8...](https://docs.google.com/forms/d/1p7tlobl5y5plBCu_enK4KawR7B8_4Yyb-
wCUh6vr9A0/edit)

I got the information from trumptwitterarchive.com

I also explored creating a system that could recognize fake tweets from real
ones and I believe I got 94% accuracy. It was a Bayes classifier but I think I
have to double check my work.

~~~
The_rationalist
"I also explored creating a system that could recognize fake tweets from real
ones and I believe I got 94% accuracy. It was a Bayes classifier but I think I
have to double check my work." Is it open source? This interest me a lot!

------
lucidrains
Hmm, no mention of Megatron in their timeline? [https://nv-
adlr.github.io/MegatronLM](https://nv-adlr.github.io/MegatronLM)

~~~
ivalm
They do mention the 8b+ "GPT-2" model trained by nVidia. Which is their
reference to megatron.

~~~
lucidrains
Oh! You are right! How did I miss that..

------
rovyko
>As part of our staged release strategy, our current plan is to release the
1558M parameter model in a few months, but it’s plausible that findings from a
partner, or malicious usage of our 774M model, could change this.

This seems naive but I think it's a misdirection. Of course the model will
have malicious users. Propaganda teams started testing its integration as soon
as it was released. It's likely that OpenAI is counting on this for insights
into HOW the model can be used maliciously. It's also possible that the model
results have inherent trackable markers and OpenAI can later say that X% of
social media posts were made using this model.

So what are the positive applications, aside from prettifying data like sports
and weather reports?

Even with Skyrim's 800+ books, you frequently ran into the same book. Imagine
libraries filled with plausible text that hides nuggets of lore seeded by
developers. Along with more realistic text-to-speech this can allow games to
support a large diversity of NPCs that have true radiant dialogue and sound
more realistic than "I saw a mudcrab the other day".

With some modifications, I think models like this can outweigh even their
nefarious applications:

Defense against text decomposition analysis. The model can be used to
obfuscate writing patterns that can reveal a person's identity, either by
randomizing form or standardizing it. Take your post and run it through the
formatter to get the same idea and intent, but in a style that can't be traced
to your other writing. Or you reform it into style of Ernest Hemmingway, like
thousands of others.

Realtime plausible deniability encryption. Messages in a monitored chat can
look like mundane conversation but contain encrypted messages. This would
require the model accept seeds and work partially in reverse to diff two sets
of text to reveal the hidden message.

In it's current form it doesn't look like it can do any of those things, but
there's the potential.

------
baalimago
Even if GPT-2 were released, very very few would have the hardware to run it
because of gpu ram running out (and doing some sort of load-unload system
would make training times unfeasibly long). And those who have the hardware to
run it, has probably already made a version of their own or reasons not to. So
I'm wondering if this GPT-2 hype is a genuine concern of openai, or if it's
mostly a PR flex to say 'Look at us, we made a good model!'.

As an example, look here by Nvidia [https://devblogs.nvidia.com/training-bert-
with-gpus/](https://devblogs.nvidia.com/training-bert-with-gpus/) who made
GPT-2 8B, which is ~5 times as large as GPT-2.

------
wyldfire
Are there any applications for the GPT-2 models beyond text synthesis?
Inference, question-answering, NER detection/disambiguation, anything like
this?

~~~
make3
BERT and its descendants do better at all of this, and are the industry
standard now
[https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805)

~~~
The_rationalist
Except that BERT is now obscoleted by
[https://github.com/zihangdai/xlnet](https://github.com/zihangdai/xlnet) (but
xlnet would never have existed without BERT)

~~~
ivalm
Kind of, there are a bunch of transformers that might perform better than BERT
(Ernie 2.0 being stronger than xlnet, for example), but often this is a
function of training size (xlnet trained on 10x more data than original BERT).
Realistically there are now BERTs released finetuned for special corpa
(biobert, clinical bert, etc) so if you want to work on those kind of texts
you are better off starting with a BERT that was previously fine tuned to
something close to your task (and then fine tune it more yourself).

~~~
The_rationalist
Well you comment was really interesting to me because I didn't know ERNIE 2.0
and it's concept of continual learning seems to be really a step forward!

But some of you statements seems incorrect: _Ernie 2.0 being stronger than
xlnet_ XLnet is the neural net with the biggest number of first places on
benchmarck leaderboards. Cf: [https://paperswithcode.com/paper/xlnet-
generalized-autoregre...](https://paperswithcode.com/paper/xlnet-generalized-
autoregressive-pretraining) While ernie 2.0 has currently 0 first place on
paperswithcode.com [https://paperswithcode.com/paper/ernie-20-a-continual-pre-
tr...](https://paperswithcode.com/paper/ernie-20-a-continual-pre-training-
framework)

 _xlnet trained on 10x more data than original BERT_ No, I've read on a github
issue of xlnet that xlnet base is same size as bert base and xlnet large is
same size as bert large. (I don't know for ernie 2)

Well your point on finetuned bert vs non finetuned xlnet is interesting.
ROBERTA is so fine tuned it beat XLnet on some tasks. But generally xlnet non
finetuned beat BERT finetuned and there are more and more xlnet finetuned each
week. But your point does apply for Roberta, and for the few tasks where bert
as been applyed but xlnet hasn't yet.

~~~
ivalm
> xlnet trained on 10x more data than original BERT No, I've read on a github
> issue of xlnet that xlnet base is same size as bert base and xlnet large is
> same size as bert large. (I don't know for ernie 2)

It's not about the size of the model, but the training data. If you read the
XLNet paper
[https://arxiv.org/pdf/1906.08237.pdf](https://arxiv.org/pdf/1906.08237.pdf)
they clearly state in section 3.1:

"Following BERT [10], we use the BooksCorpus [41] and English Wikipedia as
part of our pretraining data, which have 13GB plain text combined. In
addition, we include Giga5 (16GB text) [23], ClueWeb 2012-B (extended from
[5]), and Common Crawl [6] for pretraining. We use heuristics to aggressively
filter out short or low-quality articles for ClueWeb 2012-B and Common Crawl,
which results in 19GB and 78GB text respectively. After tokenization with
SentencePiece [16], we obtain 2.78B, 1.09B, 4.75B, 4.30B, and 19.97B subword
pieces for Wikipedia, BooksCorpus, Giga5, ClueWeb, and Common Crawl
respectively, which are 32.89B in total"

If you compare to BERT paper
[https://arxiv.org/pdf/1810.04805.pdf](https://arxiv.org/pdf/1810.04805.pdf)
training data for "pretraining data" section in also section 3.1:

"Pre-training data The pre-training procedure largely follows the existing
literature on language model pre-training. For the pre-training corpus we use
the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M
words). For Wikipedia we extract only the text passages and ignore lists,
tables, and headers. It is critical to use a document-level corpus rather than
a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et
al., 2013) in order to extract long contiguous sequences."

So 32.89B words for XLNet vs 3.3B words for BERT.

We've also run fine tuning experiments supplementing additional private
medical corpus (~10B words) and felt starting from clinical-bert was better
than xlnet (for our rather specific use cases).

------
brentsch
I'm curious about the "fine-tuning based detection" mentioned in the report
("Fine-tunes a language model to 'detect itself'... over a range of available
settings"). Does anyone know good articles/papers (or have an off-the-top
tl;dr) to get a high-level grasp of "self-detection" for generative models?

~~~
mappingbabeljc
Hiya, I work at OpenAI. I think the Grover paper is a good place to read about
some of
this:[https://arxiv.org/abs/1905.12616](https://arxiv.org/abs/1905.12616)
We're likely publishing more on detecting fine-tuned outputs in the future,
also.

~~~
brentsch
Many thanks! Looking forward to reading the OpenAI research when it comes out
as well.

------
lxe
Anyone wired a "talktotransformer"-style system to this one yet? Would like to
see how it works without going through the steps of setting it up.

EDIT: Looks like
[https://talktotransformer.com/](https://talktotransformer.com/) already uses
the 774M one!

