
Finetuning GPT-2 to Generate Beatles Lyrics - eugenhotaj
https://towardsdatascience.com/generating-beatles-lyrics-with-machine-learning-1355635d5c4e
======
gwern
His data formatting could be improved here. Title + authors would be better
off denoted somehow, like using quotes, and the separate songs should be
explicitly delimited using '<|endoftext|>' \- looking at the samples in
[https://github.com/EugenHotaj/beatles/blob/master/gpt_2_gene...](https://github.com/EugenHotaj/beatles/blob/master/gpt_2_generated.txt)
, GPT-2 does manage to mostly figure out that the songs are separate, but
omitting '<|endoftext|>' makes it harder on GPT-2, more prone to runons
(already a problem with GPT-2), and also makes prompting less effective (since
you can't prompt it like '<|endoftext|>"On The Run" by John Lennon\n' to make
it generate lyrics for a specific title & author). Also wouldn't be bad if he
had included the specific commands + hyperparameters for the nshepperd repo
he's apparently using, even if only the defaults along the lines of the
examples in my own writeup (
[https://www.gwern.net/GPT-2](https://www.gwern.net/GPT-2) ).

I'm not surprised that GPT-2-117M has memorized songs by the end of training,
it's not a very large corpus of songs. Hard to learn and generalize well from
it. If one were working more on this, it'd probably make sense to train on a
much larger and varied corpus of song (with inline metadata properly formatted
to allow controllable generation); something like RapGenius, maybe?

~~~
eugenhotaj
Hi, author here.

Yea I did the delimiting you mentioned when "training" a bigram model. For
GPT-2 I was mostly interested in how well the model would be able to pick up
signals from the raw data so I didn't do any kind of preprocessing at all
(it's also not very fun ;)). I think it's interesting that the model was able
to pick up titles, authors, starts/ends of songs on it's own.

I didn't try generating specific songs but that's a good idea. Having the
delimiters would probably improve things but feeding in "On the Run\nJohn
Lennon" would work as well with the current approach.

Using RapGenius corpus is also something interesting that I didn't think
about. The goal of the post was to generate Beatles lyrics not song lyrics in
general. To that end, I'd like to see what you get if you first fine tune on
RapGenius to learn general things like song structure, rhyme, etc, then fine
tune even further on the Beatles corpus. I suspect you'd get much nicer, less
memorized songs.

~~~
drusepth
>To that end, I'd like to see what you get if you first fine tune on RapGenius
to learn general things like song structure, rhyme, etc, then fine tune even
further on the Beatles corpus. I suspect you'd get much nicer, less memorized
songs.

OT: Is that how fine-tuning actually works with GPT-2? It makes sense that
it'd just be strengthening connections on the most-recently-fine-tuned corpus,
with previous fine-tunes still around in some way.

Should you expect that first fine tune to pick up and solidify song structure,
rhyme, etc, and the second fine tune to keep those concepts in place while
muddying up other aspects like the specific lyrics used?

(Hope this doesn't come off as "you're wrong" or too off topic -- I'm just
very interested and would love to read more about how all this works. :) )

~~~
eugenhotaj
I would expect it to (but I haven't thought about it too deeply so I could be
extremely wrong). My thinking is as follows:

At the end of the day, all we're doing is maximum likelihood estimation. So
we're trying to find model parameters which define a probability distribution
where our observed data is the most probable. In the original GPT-2, this
observed data is the text from quality outgoing links on Reddit. Since this
data is so diverse, there will not really be any special structure that the
model can pick up on, besides whatever structure exists in the English
language.

However, when we fine-tune on RapGenius, the observed data is now songs. These
songs have a certain structure to them such as stanzas, rhyming, etc. In order
to maximize the likelihood of this data, the model must learn to model the
structure.

Finally, if we further fine-tune on Beatles lyrics, the model is again trying
to find parameters which maximize the likelihood of the data. So the model
will try to match both the lyrics and the structure of Beatles songs. It's
likely that the structure of Beatles songs is pretty similar to the other
songs from RapGenius, so mostly what will change are the lyrics. Also,
changing the lyrics seems to be the most straightforward way to maximize the
likelihood since by definition we want these particular lyrics to be the most
likely.

That being said, this is all just conjecture. It would be interesting to try
out both methods and see if you get better results doing this two step fine-
tuning vs the original fine tuning (or just fine tuning on RapGenius then
conditionally sampling Beatles songs as @gwern suggested).

------
lostmsu
Or any lyrics:
[http://billion.dev.losttech.software:2095/](http://billion.dev.losttech.software:2095/)

And the blog article:
[https://habr.com/post/453232/](https://habr.com/post/453232/) (also there's
no paywall here)

~~~
eugenhotaj
Really cool stuff, thanks for sharing!

------
kastnerkyle
Tricks in beam search to force rhyme schemes, or techniques like constrained
markov chains (c.f. [https://redylan.neocities.org/#/how-it-
works/](https://redylan.neocities.org/#/how-it-works/) and
[https://github.com/gabrielebarbieri/markovchain](https://github.com/gabrielebarbieri/markovchain))
can give really strong results in lyric / structured text generation.

Might be worth investigating if you are interested in this application.

~~~
gwern
Is beam search a good idea? Whenever anyone tries beam search on a neural
language model like a char-RNN or GPT-2, it seems to generally either do
little or make it much worse (by exacerbating the repetition problem), and get
worse the more beams/computation you do: eg [https://github.com/karpathy/char-
rnn/issues/138](https://github.com/karpathy/char-rnn/issues/138) or
[https://arxiv.org/abs/1904.09751](https://arxiv.org/abs/1904.09751)

~~~
yorwba
If I'm interpreting "Tricks in beam search to force rhyme schemes" correctly,
the idea is to filter the beams and only keep those which correspond to the
chosen scheme. You don't _have_ to use beam search to be able to do that; you
could also rollback the generation process whenever it doesn't rhyme and try
again with a different alternative.

~~~
kastnerkyle
Yes - the crux is just to add some logic and throw out beams which don't match
your constraint, _then_ rank candidates based on sequence probability.

You can roll-back the generation process and/or mask the probability
distribution using simple secondary logic, but I find beam search gives
generally better results, especially when the word I want to _force_ is very
low probability - most of my sequence models kind of go off the rails when
they are forced into a low-probability sequence ("the man went to the
xylophone zebra sdawoqhdjwna"). Also I find this problem gets worse in domains
without "reset" tokens like spaces, where there are always high entoropy
possibilities (the letter after a space has a lot of good choices) followed by
lower ones (after the first letter, there often become less good choices - at
least until you hit another space). Particularly in music generation, models
that sample a "surprising" sequence tend to go off the rails. It is also a
behavior that seems worse in RNNs, than transformers for me.

------
HeWhoLurksLate
Tomorrow, anybody?

