
Train a GPT-2-1.5b Text-Generating Model with GPU for Free - tosh
https://colab.research.google.com/drive/1QE4LVEYITjIkjXxosahHVZPsSHtYZy7x
======
nonbirithm
I spent a week trying to do this by hunting down disparate blog posts that all
used a different ML framework or GPT2 finetuning tool. I didn't understand
what you were supposed to do and ended up with gibberish output instead. For
example, isn't the dataset supposed to be specifically formatted to separate
significant blocks of text, with something like <!endoftext>? The OP doesn't
mention anything about this.

It's these kinds of things that make me doubt myself at every turn when doing
this "finetuning" thing. I wish there was a clearer tutorial for people only
interested in retraining GPT2 to generate different kinds of text, alongside a
description of how each step works at an intuitive level. A lot of times the
author starts right off with "okay let's start finetuning" without really
defining what "finetuning" is or does. I just want to understand the scope of
knowledge needed to know how to use my own text corpus as the input source
without needing to become an expert in ML. (Maybe this activity isn't for me
if that's how I'm going to think about it?)

If anything, I wish Talk to Transformer still existed as a downloadable model.
I have yet to find another model that generates text like it did. At the end
of the day I just sigh and look at all the random article and comment and
adventure generators other people have succeeded in making without "getting"
it.

~~~
serendipityrecs
Hi, I don't have a comprehensive guide or anything like that, but I can
quickly answer your questions:

1\. Delimiting your text with <|endoftext|> lets the code know where the
boundaries of your text chunks are. It makes sure that during training you're
not inadvertently continuing past the end of a chunk of text into a totally
unrelated one. However, it seems like there is a bug with this re how the
nsheppard fork works:
[https://github.com/openai/gpt-2/issues/222](https://github.com/openai/gpt-2/issues/222)

Even if you're not doing this, you shouldn't be getting gibberish, so you
likely have another issue in your set up.

2\. Finetuning just means training on top of a somewhat already trained model.
GPT-2 is a language model, which means it's trying to learn (and generates
from) a probability distribution over words (actually tokens). That
probability distribution is going to be different depending on what corpus
you're working with (wikipidia articles vs news vs reddit). GPT-2 (the models
you download) is trained on a particular corpus but you as you fine-tune on
your own corpus it's going to start pulling the learned probability
distribution towards that of your dataset's.

I think the best way to play with this stuff is to run the code, like you're
doing. When you get something that doesn't make sense, dig into the
code/paper. There's also a lot of information in the readmes/issues for these
repos.

~~~
nonbirithm
Thank you for the pointers, I'll rethink my strategy.

------
lnyan
the subdomain behind the title is kind of misleading.

To be specific, _colab.research.google.com_ and _research.google.com_ are
different sites, but the fourth level domain is omitted in the domain field on
HN.

------
lowdose
Meh, I thought Google was always on the cutting edge of ML. They just tried to
pimp up a two year old dysfunctional show pony and sell it as a new racehorse.
I would like to see a trained GPT3 that's is currently up for rent from the AI
shop delivering actual innovation.

~~~
lawrenceyan
GPT-3 costs $4.6 million dollars to train currently[0]. Not only is your
comment unnecessarily negative, it's just generally a pretty bad take.

[0]: [https://lambdalabs.com/blog/demystifying-
gpt-3/](https://lambdalabs.com/blog/demystifying-gpt-3/)

~~~
lowdose
What value is added by training a model yourself? Why don't they provide that
model?

Isn't the real innovation here to give people a $4.3 million model for free?

Instead of letting people burn unnecessary compute power to train a model 2
years behind state of the art.

~~~
lacker
The main value is that you can choose your own training data. For example, if
you wanted to train a text-generation model on purely C++ code, this would be
a good way to do it.

