
Talking to myself: how I trained GPT2-1.5b for rubber ducking using my chat data - Tenoke
http://www.svilentodorov.xyz/blog/gpt-15b-chat-finetune/
======
andai
My friend and I both trained GPT2 on our chat logs. It's mostly just hilarious
seeing what comes out of it, but I've actually gotten real insight out of
"hearing myself talk" \-- it's similar _enough_ to my personality that it
shows me my interests, bad habits etc. And we can ask each other questions, or
write the first half of an answer and see what comes out. It can be pretty
weird, but we've actually gotten some great advice out of it too. (When you
train it on your own text, it still keeps its "wisdom" from the original
model.)

If anyone wants to try, I used this colab thing (I don't even need a GPU!
Blows my mind that this is free)

[https://colab.research.google.com/drive/1VLG8e7YSEwypxU-
noRN...](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-
noRNhsv5dW4NfTGce)

If you use Colab it uploads your data to Google's servers. In this case, they
already had our chats (WhatsApp backup to Drive).

~~~
prophesi
I tried asking this in the Show HN thread on that exact colab project, but how
difficult would it be to set it up in your own local Jupyter notebook if
you're okay using your own GPU?

Edit: Ah, I see in another thread
([https://news.ycombinator.com/item?id=22129978](https://news.ycombinator.com/item?id=22129978))
that your GPU needs 11gb+ of VRAM to train the data, which my 1080 certainly
doesn't have. A friend of mine works at [https://spell.run](https://spell.run)
which offers free trials for anyone interested in an alternative to Google. I
may give it a shot this weekend.

~~~
andai
[https://www.gwern.net/GPT-2#training](https://www.gwern.net/GPT-2#training)

My friend said he got it running on 8GB VRAM. But the first time he ran it, I
think it wasn't even using his GPU (it took days instead of hours to train
though).

------
blazespin
I would be curious to know how much when we write, how much of it is self-
attention and how much of it is our fore-brain actually trying to make sense?
My guess is that the more tired / rushed / burned out you are, the % of self
attention increases.

Sometimes watching the news, it seems like 90% of what they say when they are
'vamping' is just self-attention.

Has anyone posted any GPT / Hacker News generated text yet? Wisdom of the
crowds, indeed. It'd be interesting to post using it with light editing,
especially something that uses upvotes for training.

One of the things I was thinking about was training on your favorite novel, so
you could have a sort of conversation with it / ask it questions. A kind of
interactive cliff notes. However, as looked into it I realized it was still
too much of a markov chain like thing to be functionally useful. Fun idea
though.

The real win, in all of this, of course is auto completion in different
mediums. Code completion demos are pretty wild -
[https://tabnine.com/blog/deep/](https://tabnine.com/blog/deep/) Come to think
about it, you could probably use it for writing academic papers as well
assuming you know the content well.

Self-Attention and Human/Computer interaction is a very brave new world. I
don't think people really yet know the potential for seismic shift here.

~~~
leod
I've trained a Transformer encoder-decoder model (this was slightly before
GPT2 came out) to generate HN comments from titles. There is a demo running at
[https://hncynic.leod.org](https://hncynic.leod.org)

~~~
CDSlice
It doesn't seem very accurate, there isn't close to enough Electron hate
whenever it is in the title.

This is pure gold though:

> How does one make a web app using a standard framework? I've never used it,
> but it sounds like someone has been able to put together something like a
> Web app with only one app.

Edit: This is even better.

> Rewriting a Linux kernel in Rust, by hand, is definitely the right thing to
> do as a beginner/intermediate programmer.

~~~
jandrese
> Rewriting a Linux kernel in Rust, by hand, is definitely the right thing to
> do as a beginner/intermediate programmer.

Absolute perfection.

------
unnouinceput
Quote: "The conversations aren’t ideal ..."

Hi Tenoke, you got it wrong. It will never be ideal, no matter what. And I
think the opposite, those examples are actually quite ideal, you see yourself
from different perspective in the same way everybody reacts to hearing their
voice. You sound to you different then what people around you hear. You just
"heard" your AI, as crude as you think it is, for the 1st time. Thank you for
this, don't mind if I grab everything you did and do it for myself as well.
This is going to be fun!

------
fredley
Straight out of the _Black Mirror_ episode _Be Right Back_ [0] which is 7
years old.

[SPOILERS] In the episode the episode, the main character uses a service to
reconstruct a chat bot (and eventually a lifelike avatar) built from her dead
partner's social media history. Eventually she becomes frustrated by the lack
of depth (since it's only trained on social media data, it falls into a sort
of uncanny valley of comprehension and personality), but can't part with it,
confining it to the attic of her home.

[0]:
[https://en.wikipedia.org/wiki/Be_Right_Back](https://en.wikipedia.org/wiki/Be_Right_Back)

~~~
thrwaway69
Arguably if someone can get all my data online (I never use my real name or
make it easy for anyone irl to come across it)

They would have a better version of me than I am to others irl right now. It
would feel more real.

Perhaps that is something I should be worried about and change but it's never
so easy to come across stuff that require deep conversation and shows what one
is truly like. Beside, I don't say anything controversial or misleading
offline. I fear the lack of context will lead to people filling in a lot of
black holes. You can't spit out previous links or cite multiple sources to
build upon somewhat unpopular or non mainstream contrastive opinion to popular
media.

Not many people have time or attention span either anyways. Just talk about
food, daily chores and work.

------
minimaxir
From anecdotal testing, using the 774M/1.5B GPT-2 models for anything less
than _hundreds of megabytes of input data_ will result in _worse_ generation
quality than using the smaller 124M/355M models.

The addiction to the larger GPT-2 models is IMO a trap.

~~~
Tenoke
It's definitely not the case for me. I have models trained on the same dataset
which is 14mb (though I needed to tweak more for the 1.5b).

1.5b outperforms it here if trained long enough - in this case 1-2 months as I
was doing it all for free on Colab.

One of the big things was batching - it seems like nobody really tries todo
larger batches the biggest models, and without batching but while having
little data the model was getting stuck.

~~~
MasterScrat
You trained (finetuned) GPT2 for 1-2 months on 14mb of data?

I don't understand how this doesn't massively overfit. How long of these 1-2
months was the model actually training?

~~~
Tenoke
I train for maybe ~12 hours a day, some days, especially around Christmas I
didn't. I also lost a lot of days when trying out different stuff or when the
weights didn't save to drive before the Colab timed out.

Having said that, I was training the full model with an accumulated batch size
for a while so it was taking > 10min per step. I've also been using pretty low
learning rates for most of the latter stages.

Overall the model is currently at ~11k steps and the loss can actually go down
further but after playing with different checkpoints last week, the best one
didnt seem to be the newest one so I left it at that one.

------
siavosh
These tinker use cases of GPT2 (including the dungeon game) are amazing to
see. As the model improves, makes me think of essentially everyone having
access to a conversational Einstein, Lincoln etc...instant friends/advisors
from history.

------
Tenoke
The post includes a link to a Colab where you can achieve the same for free.

Warning though - it took me ~2 months of training (on and off) to get it
there.

~~~
sillysaurusx
I can't believe that someone actually used my TPU fork of gpt-2 to train 1.5B
for months. That was the goal when I made it, but I'm shocked someone actually
put in the legwork to do it.

Well done!

What were some of the Colab pain points you ran into? Sometimes Colab unmounts
the drive folder for me, or fails to upload any data until the runtime is
reset. But those cases have been pretty rare.

Did you have to micromanage disk space much? Google drive gives lots of space,
but it goes by pretty fast when each snapshot is 5.6GB.

(Anything I can do to make this process easier? Feature requests / fix
requests are always welcome.)

~~~
Tenoke
Thanks again for making it possible!

>What were some of the Colab pain points you ran into?

You've thankfully added fixes for some of the big ones - like how you cant
just straight delete a file because it sends it to the Drive's Thrash.
Emptying them out is a nice approach.

Some of the big annoyances were having to keep the Colab tab open on a machine
at all times. Dealing with the leftover small files. Drive adding encoding
changes to files, thus often making it hard to pull changes even if I git
stash and reset --hard. Occasional (though not that often overall) complete
stops for no reason - not even an error. Mounting drive takes you to auth out
of the notebook for no real reason. Different lib versions between their GPU
and TPU runtimes. Nothing too big, really - just minor annoyances.

>Did you have to micromanage disk space much? Google drive gives lots of
space, but it goes by pretty fast when each snapshot is 5.6GB.

Yes, so I bit the bullet and just paid a few $ for Google One to save myself
the trouble after a few weeks of dealing with it.

>Anything I can do to make this process easier? Feature requests / fix
requests are always welcome

Add a better README. That would probably be the highest value change you can
make to the repo.

------
leblancfg
> Fun fact - there is a sentence in this post written entirely by the GPT
> version of Me. I wonder how easy it is to spot.

...I couldn't spot it. Anyone? Eerie...

~~~
jerf
My best guess is "Additionally it sometimes talks about things that aren’t
really True - like the back pain in Example 1, and if you play with the
different parameters (top_k/top_p and temperature mainly) you can force it to
go on long tirades which eventually become nonsensical."

True shouldn't be capitalized like that (influence from sample Python code or
another language that uses "True"?), and Example 1 doesn't discuss back pain.
I don't know enough about GPT or whatever other possible models may be getting
discussed to know whether "top_k/top_p" make sense, though temperature would
seem to.

------
fapjacks
There have been a number of posts over the last few days like this about
giving (more) of your (sensitive) data to Google. Lots of comments in the
threads about exporting and uploading messages from e.g. WhatsApp and
Telegram, and a surprising lack of concern about it.

------
sirsuki
I am surprised that the book [“The Blue Nowhere” by Jeffery
Deaver]([https://terebrate.blogspot.com/2012/11/book-review-blue-
nowh...](https://terebrate.blogspot.com/2012/11/book-review-blue-nowhere-by-
jeffery.html?m=1)) hasn’t been noted as one of the plot points explores the
machine human interaction like this. Neat read BTW.

------
supernintendo
This is so fun. A question for you (or anyone else familiar with this topic),
what hardware you would recommend for someone just getting into training GPT2
models? Would a Radeon RX 580 be enough?

~~~
minimaxir
You cannot train any GPT-2 models with an AMD GPU. Nvidia's CUDA is still the
de facto toolkit.

Either use Colab (free), or a preemptible GPU instance on GCE w/ the Deep
Learning VM image (relatively cheap). Using consumer GPUs is a recipe for
frustration.

~~~
Tenoke
>You cannot train any GPT-2 models with an AMD GPU.

It seems like you can. I know of at least one person who has finetunned 1.5b
on a 16GB AMD. I think u/sillysaurusx had some part in it, but apparently
translating the code from CUDA was fairly easy.

~~~
gwern
There are also several people on Twitter who have mentioned training it on AMD
GPUs.

------
mirimir
Hey, I just talk to myself ;)

Sometimes I use different voices, for emphasis.

I actually learned that in a course. The context was having a completion
conversation with someone who had died. But it works in other contexts too.

------
kqr
This is like a personalised version of Oblique Strategies. Exciting!

------
drcode
...the moment where he jokes about "turning it on and off again" and his GPT2
doppelganger laughs...

------
mycall
> predict the next word in 40GB of Internet text

This could do wonders for lip reading correction.

~~~
britmob
OpenAI trained the initial 1.5B model on ~160G of text.. so I’m sure it’s
already going to give amazing results.

------
qnxub
Is this the best way to create a chatbot with my personality? I feel like I
would want to fine tune some things so it is giving real responses about my
preferences, hobbies, etc.

My use case is preserving my personality for loved ones after I die.

~~~
backupcavalry
Not knocking you but I'd love to see some research on whether this would
actually be a positive for loved ones - for me, I know I'd prefer them to move
on to fresher things in life.

That and the Black Mirror episode another commentor mentioned.

------
sroussey
I want to train on my MacBook. What are the options?

~~~
Tenoke
I include the link to the Colab, which means it's trained for free on Google's
machines, and you just access it from your browser.

Of course, you might not want to have sensitive data on Google's machines for
one reason or another, in which case you'd have to buy an external GPU, or
better yet a whole other machine.

~~~
minimaxir
Training the smallest GPT-2 model uses about 11-12GB of GPU VRAM; consumer
GPUs cap out at about 8GB.

GPT-2 1.5B will _definitely_ not train on a consumer GPU.

~~~
Tenoke
You can't train the full thing, but you can freeze everything except the
transformer layers (which is what shawwwn and gwern do anyway even though they
do have the memory). You also need gradient checkpointing of course.

~~~
sroussey
Can anything be done on a mobile device yet?

~~~
Tenoke
Yes, there are a lot of modells designed to work okay on mobile. Though you'd
typically train in the cloud and only use the trained model on the phone.
Alternatively, you can train over many phones, which brings a lot of extra
challenges but is definitely possible.

Google's very new Reformer[0] would likely be your best bet if you want both
something truly cutting-edge and have less compute, even as little as a
mobile's. As far as I know, it hasn't been used on phones yet (again, it's
very new) but I bet it can be done.

0\. [https://ai.googleblog.com/2020/01/reformer-efficient-
transfo...](https://ai.googleblog.com/2020/01/reformer-efficient-
transformer.html)

~~~
sroussey
Interesting! Thank you for the link.

I don’t mind training on a desktop and use it on both desktop and mobile. We
kinda already have that problem since we parse Google data for a given android
phone, but it doesn’t have the memory or compute for the amount of data the
phone has generated over the years. The user will background the app too
quickly. So we need to ask the desktop app to do it, process there, and sync
results back.

------
gambler
Can't wait until chat bots trained on someone's messages are used as
"evidence" of what that person thinks. It's blatantly obvious that the crowd
here would accept this as valid analysis if the whole thing is peppered with
appropriate buzzwords.

