
OpenGPT-2: We Replicated GPT-2 Because You Can Too - programd
https://medium.com/@vanya_cohen/opengpt-2-we-replicated-gpt-2-because-you-can-too-45e34e6d36dc
======
steve19
It sounds like all the drama OpenAI made about not releasing the model was all
just marketing. $50,000 is nothing for a nation-state or even just a motivated
third party. I had always assumed OpenAI had spent well into the 6 or even 7
figures to train the full model.

MSFT has sort of invested $1 billion into OpenAI so I guess it worked!

~~~
anchpop
Before they release it, they need to prove to themselves it's safe (I'm
speaking morally). I'm not sure why so many on HN seem to think it's the other
way around, that you have to prove it's dangerous before withholding it. $500k
for reproduction is peanuts for a nation state but quite a bit for many other
groups who might be interested.

Even if it is entirely safe, it's still good to withhold it because it helps
create a culture where people think about the safety of their projects before
releasing them

~~~
Bartweiss
There are two very different questions here: is it good to withhold, and is
that sensible for _OpenAI_ to withhold? I think a lot of people on HN view
restricting GPT as incoherent by OpenAI, where they might not for another
group.

OpenAI's _raison d 'etre_ was democratizing access to AI tools, and preventing
them from being abused by concentrated powers like governments. If replicating
GPT-2 is trivial for state actors but prohibitive for hobbyists and other
private citizens, it's creating the same issue Musk described OpenAI as
setting out to oppose. Even the general idea of treating AI tools as
hazardous-by-default goes a long way to validating the project's original
critics.

------
anon1253
"The cost of training the model from scratch using our code is about $50k."

Still a substantially steep curve for a bootstrapping startup. It's something
I continually run into myself. I have somewhat of a weekend project trying to
build a search engine but man ... the cost of just the SSDs and GPUs is
daunting on a regular salary. As the complexity of these models grows, so does
the barrier to entry for a regular joe like me; which is a shame I think. I
know in the US it's fairly normal for a data scientist to pull 100k+ / year,
but in the Netherlands salaries pretty much stall at 40k (and angel investment
in IT/AI is at an all time low). More generally I fear this will become a bit
of a sociotechnical issue if complex AI models will be out of reach for entire
economies (especially for cases like language because not everyone speaks
English and "minor" languages like those in EU countries are a massive market
to explore, yet hard to get into).

~~~
tzapzoor
And they're just "two masters students, with no prior experience in language
modeling" with $50k lying around for training a huge model.

~~~
p1esk
They mentioned they spent $500k (in research credits) on all the experiments
to actually find the hyperparameters.

~~~
godelski
Where did you see that?

Also how do two masters students with no experience in NLP get $50-500k in
compute credits? How do I get that deal?

~~~
p1esk
It was in the article last night, they deleted it for some reason after my
comment.

One of the authors has a peer reviewed NLP publication [1], the other has
several publications in computer vision. I don’t know how they got research
credits from Google.

[1] [https://arxiv.org/abs/1905.13153](https://arxiv.org/abs/1905.13153)

~~~
godelski
Well that is pretty dubious. That completely changes the metric of how much
money and how difficult this costs. (And I definitely believe you because
there's several other comments with that same number). Could be a typo? But I
feel like that's something you'd say. Then again, these are masters students.

~~~
p1esk
It's $50k per _training run_. Their main contribution has been finding the
optimal hyperparameters, not described in the OpenAI paper. Obviously you need
more than one training run to to that.

------
high_derivative
It may be high time to discuss what AI policy has actually done so far. From
what I can tell, not much other than letting social scientists get in on the
deep learning gravy train.

Meanwhile, misuses of ML are proliferating without limits, and 'AI policy' is
apparently mostly used as a fig-leaf to collect good-will, marketing, and buy
a seat at the table for future regulations. As usual, regulations will protect
incumbents, so my as-usual cynical read is that OpenAI's policy interests are
about protecting its own future interests. From that perspective, the entire
GPT-2 stunt was highly effective.

Now depending on your outlook, that may be an argument that we need more
people in policy, or fewer. Or different ones.

~~~
repolfx
_Meanwhile, misuses of ML are proliferating without limits_

Are they? Where?

For all the hype I haven't seen any obvious abuses of AI. I've seen better
speech recognition and a few other useful things. I've seen stuff that's wrong
but ultimately kind of trivial like auto-generated porn with celebrity faces,
but that's the worst stuff so far.

I haven't seen clear, unambiguous cases of abuse beyond that. I've seen a lot
of _allegations_ that AI is being abused e.g. "Russian bots" but on
investigation these stories usually evaporate.

If anything I've been kind of disappointed by AI so far. Amazing demo videos
abound, but I'd guess 90% of the impact of AI in my life has been Google
improving their already quite good services. Better translations, better
search results etc. All very welcome but not really life changing.

~~~
visarga
> better search results

Funny you say that. I was recently searching for a e-scooter lighter than 10Kg
and all Google could find was the max allowed weight of the person riding the
thing (around 100-130kg). Not to mention that it didn't understand how to make
a conjunction and show me light e-scooters with suspension. It's just matching
keywords without understanding anything about the relations between them.

I am disappointed at the current quality of Google search, especially in
shopping related queries where money is to be made. Instead of stuffing my
pages with irrelevant 'personally' targeted ads and tracking my every move
they should make an effort in that particular moment when I actually want to
buy something and give me a good suggestion.

~~~
derefr
Coincidentally, this kind of "modelling the question" is what IBM's Watson
is/was supposed to be uniquely good at. It seems like Google hasn't even
really considered entering the same space. Maybe each query has too high a
incremental cost to run for them to be profitable right now?

------
6gvONxR4sf7o
It's worth noting that, like other attempted replications, the perplexities of
this model mostly aren't as good as GPT-2. Given that the title of the GPT-2
paper was "Language Models are Unsupervised Multitask Learners," I'd be
interested in a lot more metrics before I'd believe GPT-2 has actually been
replicated. Especially because every other time someone says this, metrics
show otherwise. Until then, this is just a really big model.

~~~
Smerity
I made the WikiText-2 and WikiText-103 datasets they compare against and held
state of the art results on language modeling over PTB, WT-2, and WT-103 not
too long ago.

OpenGPT-2's results are near equal to GPT-2's in zero-shot language model
perplexity on multiple datasets [1].

The zero shot perplexity results are also exactly where we'd expect the 1.5
billion parameter to be, markedly better than OpenAI's 775M GPT-2 model[2]
(the second largest model OpenAI trained) that they released in the last few
days.

To me this is about as close a replication as you could expect, especially
given OpenAI didn't release many of the exact training details. If OpenAI
retrained the 1.5 billion parameter GPT-2 model I wouldn't be surprised to see
the same variance in performance simply due to the random initialization of
the parameters.

[1]: [https://miro.medium.com/max/3200/1*h1JoiQq9f1qOHS-
rN4u57A.pn...](https://miro.medium.com/max/3200/1*h1JoiQq9f1qOHS-rN4u57A.png)

[2]: [https://www.semanticscholar.org/paper/Language-Models-are-
Un...](https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-
Multitask-Learners-Radford-
Wu/9405cc0d6169988371b2755e573cc28650d14dfe/figure/3)

~~~
6gvONxR4sf7o
That's true, but isn't the point of GPT-2 that it's a strong at many tasks? It
did really well at a lot more than just the four perplexity measures reported
in OP.

------
Felz
What's it take to actually run a model like this, hardware-wise? I've been
toying around with a gpt2 discord bot
([https://github.com/ScottPeterJohnson/gpt2-discord](https://github.com/ScottPeterJohnson/gpt2-discord))
using just a CPU calculation, and already it takes up 2 GB RAM (and is slow
obviously) on the 345M model. I might be able to get the 774M model running,
but there's no way I can afford the full model, assuming linear RAM use. And
that's just for CPU compute, I can't even begin to imagine how expensive GPU
would be.

~~~
ageitgey
Inferencing on this model works fine on Google Colab which gives Tesla K80 GPU
with access to 12GB of GPU RAM. You can buy a used K80 for probably about
$850, but it's not really ideal for putting in a home computer because of the
cooling requirements.

[ deleted reference to 2070 Super ]

~~~
p1esk
Used K80 can be had for $350 [1] Not bad actually (it's probably as fast as
1080Ti, and has 24GB of memory).

[https://www.ebay.com/itm/NVIDIA-Tesla-K80-GDDR5-24GB-CUDA-
PC...](https://www.ebay.com/itm/NVIDIA-Tesla-K80-GDDR5-24GB-CUDA-PCI-e-GPU-
Computing-Accelerator-Card/193033951488)

~~~
happycube
K80 is 2 GPU chips with 12GB, so it's not always as good as one newer/larger
GPU. Much more affordable though :)

~~~
p1esk
If I remember correctly K80 memory is actually 24GB, not 2x12GB. This is a
pretty important distinction in this context (training GPT-2).

Also, you can get at least 6 K80s for the price of a single RTX Titan (also
24GB). So it would be faster (I don't think RTX Titan is 6x faster than K80)
and 6x more memory for the same price. It's a very good deal.

------
minimaxir
Twitter thread by a Research Scientist at OpenAI addressing OpenAI's policies
in response to this discussion here:
[https://twitter.com/Miles_Brundage/status/116495932263331840...](https://twitter.com/Miles_Brundage/status/1164959322633318400)

------
exabrial
Without context, this article reads like something generated by machine
learning.

------
p1esk
They spent $500k replicating it. But sure, you can do it too /s

~~~
gwern
They used research credits, and even that aside, with their code and training
tips, you can redo it for $50k on cloud instances or less on dedicated
hardware + patience. And look at ImageNet training progress: you can train a
near-SOTA ImageNet CNN in like a minute for $20-40 after a lot of optimization
work. We've already seen a lot of improvements in LMs over the past 2 years...
(For example, the main barrier to training GPT-2 is just the bloody memory use
from the Transformers exploding at runtime, which pushes you into high-end
hardware like cloud TPUs on GCP. Do Sparse Transformers fix that?)

~~~
p1esk
Wait, how can I get to near SOTA on Imagenet in a minute (!) for $40?

~~~
gwern
OK, I exaggerated a little because I was recalling from memory: the old
fast.ai approach actually takes <18 minutes
([https://www.fast.ai/2018/08/10/fastai-diu-
imagenet/](https://www.fast.ai/2018/08/10/fastai-diu-imagenet/)). My bad. (I'm
sure it's improved since then but I don't know how much.) I was also thinking
of [https://myrtle.ai/how-to-train-your-resnet-8-bag-of-
tricks/](https://myrtle.ai/how-to-train-your-resnet-8-bag-of-tricks/) which
does CIFAR-10 in 26s but I'm not sure offhand what CIFAR-10's SOTAs look like
so not sure how far away that is.

~~~
p1esk
Actually this is still very good. Thanks for the links. I'll be timing some of
these tricks tomorrow for my Imagenet experiments. By the way, I believe this
is the current SOTA for Imagenet:
[https://arxiv.org/abs/1905.11946](https://arxiv.org/abs/1905.11946) (84/97%).
CIFAR10 appears to be essentially solved (99%).

------
macawfish
It's fun to feed it the first paragraph of the Sokal affair paper:

Prompt:

 _There are many natural scientists, and especially physicists, who continue
to reject the notion that the disciplines concerned with social and cultural
criticism can have anything to contribute, except perhaps peripherally, to
their research. Still less are they receptive to the idea that the very
foundations of their worldview must be revised or rebuilt in the light of such
criticism. Rather, they cling to the dogma imposed by the long post-
Enlightenment hegemony over the Western intellectual outlook, which can be
summarized briefly as follows: that there exists an external world, whose
properties are independent of any individual human being and indeed of
humanity as a whole; that these properties are encoded in ``eternal ''
physical laws; and that human beings can obtain reliable, albeit imperfect and
tentative, knowledge of these laws by hewing to the ``objective'' procedures
and epistemological strictures prescribed by the (so-called) scientific
method._

Completion:

 _Before one can explore these ``objective '' principles and procedures, one
must do something important, namely, invent the objectivity -- or, more
precisely, the objectivity-transcendent properties -- of the ``subjective''
functions of ``man" (as a species of ''objective'', or something-in-itself-
independent, human, being). Since the 19th century, this has been done on a
literal or phenomenological level, the metanarrative of psychoanalysis.
However, the ultimate test of what constitutes an objective fact or structure
is the statistical discovery of reliable facts or structures residing in the
human mind or body. That is what the discipline of psychology is for. The
distinction between the subjective and the objective is therefore necessary
for the discipline of psychological science, which is by no means a waste of
time, which is today a flourishing field of scientific inquiry, and which
might contribute a great deal toward furnishing a genuinely complete
theoretical understanding of human behavior and psychical experience.

In short, the philosophical implications of the post-Enlightenment philosophy
of the sciences and of psychical psychology are -- when considered in the
context of both human existence and human thought -- deeply unsatisfactory.
Once we abandon the most basic postulate of scientific rationality -- that
subjective knowledge and belief and behavior are not to be said to be existent
but to be, not to be, knowable -- there is nothing left to suggest that the
content of human knowledge and experience is dependent upon our conscious
perceptions or personalities. We might describe that content as ineffable (it
is irreducible to language, outside the experience of individual thinkers), or
inconceivable (that which is not one's own consciousness, but is beyond the
realm of conscious experience). In either case, there is no genuine need for
an objective world, or, at least, no need for an objective foundation. In that
sense, it is erroneous to assert that the scientific standpoint is a paradigm
of objective reality; rather, it is the antidote to the paradigm of
irreducible subjectivity and self-consciousness.

Unsupported by this analysis of the problem of scientific knowledge, the field
of psychical science is often rejected by younger scholars, for whom the study
of the human condition and its ineffable psychology is regarded as a
particularly enticing but very unapproachable form of mysticism. They regard
this section of the theory and practice of modern scientific inquiry as a kind
of metaphysical theory-cracking rather than as a vehicle for the discovery of
objective facts. Such a view can only be sustained on the basis of the
strictest adherence to the doctrines of empiricism and fundamental
materialism, respectively, and it is by no means fruitful for the real
progress of science. Let me explain.

The residual emphasis on irreducible subjectivity in modern Western thought
(and only by the residual emphasis) is a legacy of the metaphysical concept of
objective reality which, to my mind, is still embraced by the cognitivist
movement of the 20th century. It is an unfortunate legacy. Irreducible
subjectivity and subjectivity-dependent, subjective perceptions were at the
root of the metaphysics of Buddhism, and these beliefs now constitute the
dominant conception of reality in contemporary Western philosophy,
particularly Kantian, Descartesian, and Hegelian thought, which are each
deeply indebted to Aristotelian theories of subjectivity.

While both these views of reality are patently absurd, and since the
scientific revolutionary of the middle 19th century was able to disregard or
repudiate them, the followers of Kant and Descartes were able to maintain that
there is no need for a foundation for the science of knowledge. They could do
this because they held to a primitive, problematic conception of objectivity,
based on the notion of an objective, external world, in which human
consciousness, thus independent of any particular body, mind, or culture, was
inchoate, mutable, and subject to change or speculation. There was therefore
no need to search for a theory of experience. Science and experience were
simply different approaches, of which each was as good as the other, and they
both..._

------
macawfish
Prompt:

 _The jig is up. And what now? Where do we go from here?_

Completion:

 _Where Do We Go From Here: In the aftermath of the fall of the German
Republican party, we now have a significant degree of instability across the
earth’s systems of government and finance. The almost complete collapse of
systemic forces in the Eurozone and limited success at stabilizing the system
means the question is not if but when, what do we do next?

The answer is simple. We must move beyond the localized, bubble-like, and
short-termist “get involved,” tactic of getting into the scene and trying to
control it in some way. We have to come up with a way of shifting the socio-
political power in the world, the prime place for transformation is worldwide
at the supra-system level and not just the economy and finance. We must cast
out the old dominated system, of which we have been just a part and recognize
that we need a new dominant system that serves human interests, and the meta-
level global system must serve human interests.

The fact that the global status quo is collapsing of its own weight shows us
that the system is structured in such a way that the group of big players who
have dominated and still dominate, are in ever-decreasing danger of losing
both power and integrity.

The question now is, how? How do we avoid degeneration into chaos and conflict
when the anarchic nature of the system leads inevitably to greater and greater
competition among and frustration and anger in the younger generations?

This is a society, this is a planet and we live in the first global century of
human history, which the young will pass from generation to generation in the
next twenty years, or perhaps not. When we see the events of the last weeks
and days, you can just imagine what will happen to this planet, to this planet
and human society in the century ahead, and you can just imagine what the
future will bring to this and subsequent generations.

When history describes the past, it sees the collapse of an old political
establishment, of the traditional hierarchies of power, of society, of
economics and finance. It sees a collapse in the old order of power and in the
equilibrium it has created, which was grounded in constant growing jobs and
the prosperity it produced. We are in the middle of a permanent expansion of
capitalism, which also creates ever-growing wealth and prosperity for a small
population of wealthy earners, while social polarization and inequality
increase and older people depend on each other more and more desperately.

The forward and downward momentum of all these forces has created a situation
in which there is almost no limit to the volume of the day to day, or minute
to minute production and consumption, and in which there is no single concern
about the future of the planet Earth. We have become so insatiable, in need,
addicted to this ever increasing appetite for consumer goods, that we destroy
the planet with it.

You see this just from what we feed our children, the choices we make, and the
products we consume. You see it in our greedy attempts to buy as much as we
can, even if it leads to ecological ruin. You see it in our drive to consume
new and ever more lavish luxury products, materials, tools, devices,
insatiable lifestyles, modern-day imperialism, racism, cynicism, competition,
greed, consumerism, hubris, and endless pursuit of personal ambitions and
leisure.

See how when push comes to shove, the social and economic growth created by
the continued expansion of capitalism is now a life or death matter. See how
the political establishment has failed us, all of us, and how we turned in
desperation to another self-serving-self-protective-petty, self-interested-
philistine mass-mediator, in the form of Mr. Romney, in order to maintain the
old sources of power, to make the old social structures fit to serve human
needs and the system could be kept going.

And now he has packed his bags and wants to leave, so there we are, stuck here
with those of us who have found a way to provide for ourselves and live
peacefully and prosperously, without the brutality and violence visited on us
by politicians and corrupted systems. That is, unless we fix these broken
systems and deliver an alternative based on human needs and human compassion.

How do we do it? How do we get there? Stay tuned and we’ll let you
know.<|endoftext|>This exciting book is an overview of a phenomenon that
started in the 1970’s, and became the most spectacular of all the urban myths.
It combines all things the paranormal in this feature length book, from
scientists to aliens to experimental reports.

William Kean is an astronomer working for NASA. One evening he is on his way
to a remote overlook on a Martian hill. Suddenly, he is teleported to the top
of a fifty story building, two thousand feet in the air. The building_

~~~
repolfx
The sudden lurch into paranormal book review at the end is interesting. I
guess it's because it went down the path of, _" How do we do it? How do we get
there? Stay tuned and we’ll let you know."_ ... that last sentence is probably
very likely to occur in conspiracy theory/paranormal/UFO texts.

It's very interesting that the model generated a basically coherent speech
that could have come from any left-wing event or politician, given nothing
more than "things are bad, what next" as a starting point. GPT-2 has correctly
learned that Marxist thought is based on a form of catastrophism, as anyone
who has read Marx will confirm.

It's going to be fascinating to see how people use this. My guess is "that
sounds like an AI wrote it" will become an insult meaning predictable and
content-free.

Even more fun will be putting the model into reverse and calculating a
predictability score - if given the starting point of a real human written
speech, GPT-2 rates each next word as highly likely, the overall speech can be
said to be only N% insightful, where N is an actual scientifically defined
measurement.

Many people seem to adopt dystopian catastrophism about AI but I feel somewhat
optimistic. In the same way that automated spelling and grammar checkers can
help people write better, a GPT-2 run in reverse could help people write
clearer prose that gets to the point quicker, or perhaps even force people to
accept when they don't really have anything new to say. If a speaker doesn't
use it then someone in their audience will, after all.

~~~
macawfish
Notice that "<|endoftext|>" delimiter. A lot of the samples I generated had
that, and would then rapidly switch into a whole different tone or style.
Maybe there was an error in their training where they somehow didn't separate
training samples properly? I don't know enough about machine learning to say.

I also find it interesting that this sample got -4 points where the Sokal
affair sample I posted got +4 points.

I imagine it has more to do with the emotions each sample evokes in various
hackernews readers. Could it be that hackernews readers are likely to have a
distaste for postcolonialism, but are likely to be fans of materialist
rationalism? I think so, based on years of reading their comments :)

~~~
AdamDKing
On the <|endoftext|>: GPT-2 and this model were trained by sampling fixed-
length segments of text from a set of web pages. So if the sample happens to
start near the end of one page then it will fill in the rest of the length
with the beginning of another page. The model learns to do the same.
TalkToTransformer.com hides this by not showing what comes after the
<|endoftext|> token.

~~~
macawfish
That explains why sometimes the talktotransformer samples are so short!

------
Chris2048
I skimmed the article and stated reading the last paragraph to get an idea of
what it was about. I was v confused..

