
Show HN: Train a language model to talk like you - MasterScrat
https://colab.research.google.com/drive/1iHcQ8_K0cfRE3v8QX6FMKAzdSSGtf5IX
======
MasterScrat
You may have seen my recent post about [Chatistics: a Python tool to parse
your Messenger/Hangouts/WhatsApp/Telegram chat logs into
DataFrames]([https://news.ycombinator.com/item?id=22069699](https://news.ycombinator.com/item?id=22069699)).

This notebook uses the exported chat logs to train a simple GPT/GPT2
conversational model! It uses Google Colab, a notebook platform that allows
you to train complex models online for free.

The approach is super simple: it takes all your chat logs, turns them into
this format:

> <speaker1> Hi

> <speaker2> Hey - how are you?

> <speaker1> Great, thanks!

> ...

...then simply trains a GPT model on this corpus. In practice, I found that
the default parameters (including using GPT and not GPT2) give the best
resources for this setup.

This notebook will be part of our workshop "Meet your Artificial Self"
happening this Saturday at AMLD 2020 in Lausanne, Switzerland:
[https://appliedmldays.org/workshops/meet-your-artificial-
sel...](https://appliedmldays.org/workshops/meet-your-artificial-self-
generate-text-that-sounds-like-you)

Feedback is welcome! :D

~~~
prophesi
I definitely need to give this a whirl. Does it use Python 2 or 3, and is it
as simple as importing its ipynb to run it in a local Jupyter notebook?

------
capableweb
I got a bit tricked by the title here on HN. Maybe we can replace `talk` with
`write`? Thought this was something that could learn how I speak and could
generate sound from that, but seems to just be able written language, which is
not nearly as interesting (for me).

~~~
moron4hire
Yeah, Microsoft has had NN-based speech generators that can mimick your own
voice for about a year now. Thought this was going to be a competing service.

------
arethuza
I'm disappointed that this is about typed text rather than actual talking - I
had hoped that training something that talked like me might assist technology
vendors in actually creating voice recognition technology that works for me.

And yes my problems with voice recognition are probably due to my Scottish
accent.... ;-)

------
Tenoke
I've been playing with training different sizes[0] of gpt on my own chat data
precisely for this reason.

Coincidentally, today I was even planning to publish my last post and notebook
for training gpt2-1.5b and then chatting to oneself with the model. I left it
for tomorrow though.. Maybe a mistake.

There is quite a lot you can do and talking to my trained model which is
responding to me as me can be real weird at times. It's definitely the most
engaged Ive been with gpt while talking to myself.

Having said that you seem to train here on very little. Still - cool demo.

[0]
[https://svilentodorov.xyz/blog/gpt-345M-finetune/](https://svilentodorov.xyz/blog/gpt-345M-finetune/)

~~~
MasterScrat
I would be very curious to see your notebook - while this simple approach
works well with GPT, we are not getting the results we'd want with a more
complex question/answer model that uses GPT2. So I'd love to see your
implementation details!

> Having said that you seem to train here on very little.

The datasets provided in the notebook are really meant to be fallbacks for
people who are not willing to use their own chat log data. When training on my
own data, I have about 500k messages, which starts being enough to get
interesting results.

edit: wow, I see you're training on "14M facebook messages", that's impressive
- do you actually chat that much?!

~~~
Tenoke
I just pushed it, the blog post (which includes the notebook) is here[0].

It's 14mb of data, not sure how many messages it actually comes down to but FB
Messenger has been my main platform for talking to friends for the last
decade.

0\. [https://svilentodorov.xyz/blog/gpt-15b-chat-
finetune/](https://svilentodorov.xyz/blog/gpt-15b-chat-finetune/)

~~~
MasterScrat
Awesome :D

We've published the third notebook:
[https://colab.research.google.com/drive/1XYNef9zcHhTjt6kM6yd...](https://colab.research.google.com/drive/1XYNef9zcHhTjt6kM6ydL9oXTshoRknIV)

We would gladly have your feedback on our approach

------
perturbation
This is cool - might be worth training a simple discriminator model to
identify _your_ utterances, and then you can use the plug-and-play language
model (PPLM -
[https://github.com/huggingface/transformers/blob/master/exam...](https://github.com/huggingface/transformers/blob/master/examples/pplm/run_pplm.py))
to generate utterances modeling a specific speaker without special tokens.
Could also take less time to fine-tune.

------
the-dude
I totally missed that Lyrebird was acquired :
[https://news.ycombinator.com/item?id=21006405](https://news.ycombinator.com/item?id=21006405)

------
data_ders
My curiosity is tempered by the fact that I've seen this episode of Black
Mirror before... :)

[https://en.wikipedia.org/wiki/Be_Right_Back](https://en.wikipedia.org/wiki/Be_Right_Back)

~~~
ferCats99
I think is more like White Christmas

------
bryanrasmussen
A computer trained to talk like me would spend a lot of time swearing and
whining about how it can't take it anymore, which I admit would be pretty
funny.

------
raidicy
This is part of a workshop series[0]. Does anyone know if the talks/shops will
be recorded?

[0][https://appliedmldays.org/workshops](https://appliedmldays.org/workshops)

~~~
MasterScrat
We don't have any plan to record it currently.

But we will release the two other notebooks used during the workshop, and plan
to write a blog post detailing the full content.

~~~
raidicy
That sounds great! This and the main workshop site have many subjects I'm
really interested to check out. Thank you!

------
thisisastopsign
I’ve never used PyTorch before... is this running within my local machine, or
is there some API in here that’s also sending data to Google to also train
their models? Asking a privacy point-of-view..

~~~
heybrandons
The python notebook is hosted on google colab which will execute on free (for
you) google servers. If you’re concerned about privacy probably do not upload
your personal chat logs. You could also download the notebook and install
resources on a machine you control. There looks like alternative datasets to
test for Obama and movie dialogues

------
woefulregret
throwaway, duh.

When I was a teenager I wrote a very graphic and very disturbing work of
fiction that was archived on a popular erotica text website.. I have had
anxiety for many years now that eventually someone will glue the authorship of
that story to my identity.. If people in my real life discover my fantasies
from years back because of my writing signature, I do not want to guess where
that will leave me.. I am not looking forward to the future!!

------
fudged71
Could you train this on a Q&A/FAQ corpus and get somewhat relevant results?
(And is there any better tool for doing this?)

~~~
cyorir
Along these lines, I worked on a team project in a university course to create
an automated Q&A system making use of IBM Watson. We chose to focus on a Q&A
system for business regulation in the state of Illinois. However, just using
existing FAQs isn't sufficient. To build a corpus, we scraped several websites
belonging to the state of Illinois for any information that would be relevant
to businesses operating in Illinois. Then, we created sample question-answer
pairs, with answers taken directly from the corpus. Using both the provided QA
pairs and the rest of the unlabeled corpus, Watson trained a model to answer
questions that hadn't been trained on by providing excerpts from the corpus.
By ensuring that the model was providing excerpts from the corpus, we wouldn't
have to worry that we were providing (too much) incorrect information; most of
the time, the answers were relevant, too. Of course, you could create a
similar system without using proprietary IBM software.

------
MadWombat
Oh, oobee doo

I wanna be like you

I wanna walk like you, talk like you, too

You'll see it's true someone like me

Can learn to be like someone like you

------
alfonsodev
This is going to be useful for when we fully turn into cyborgs.

~~~
pjmorris
Maybe we're already there. Example: I've got a friend who worked in tech
support long enough that he built a soundboard of recordings of his voice
asking typical tech support questions in response to user problems.

~~~
whatshisface
They did that on the show "IT crowd."

------
nickster
I wonder if they are using this in Android Messenger or Gmail for the
suggested responses.

~~~
neodymiumphish
I really don't send many emails through Gmail, but when I do it is INSANELY
accurate in its suggested sentence completion. Sometimes simple stuff like an
address or whatever, but it can get really creepy when I'm sending something
to my wife as a reference for some bill or interaction with our landlord and
it knows exactly what I'm trying to say after just a word or two (sometimes,
something like "Hey, I just..." and it has the rest of the sentence ready to
go).

------
heybrandons
Thanks for sharing MasterScrat! This looks fun!

------
brainzap
Train it on Fred Rogers

