
Generating MIDI Music with GPT-2 - gwern
https://www.gwern.net/GPT-2-music#generating-midi-with-10k30k-context-windows
======
mebr
The author mentions about 5% of the generated music is any good to take a
listen. Even the best sounding ones shared don't sound that good to me.

~~~
narag
Most of it sounds terrible. I guess that it's because there is no way to give
some feedback. A chess program knows what means winning. And IIRC "this person
doesn't exist" uses human opinion to train.

Music has many rules, not only theoretical but unwritten rules about what
works. You must incorporate them somehow into the program, either by code or
offering something to the program to deduce them.

~~~
gwern
I previously tried an approach which uses DRL for feedback but I couldn't
quite get it to work: [https://www.gwern.net/GPT-2-preference-
learning](https://www.gwern.net/GPT-2-preference-learning)

At the moment, it probably would be more practical to train a model to predict
ratings and use that to screen generated samples or possible completions and
throw out too-low-scoring ones (the 'ranker' approach worked out very well for
the Meena chatbot recently:
[https://arxiv.org/abs/2001.09977](https://arxiv.org/abs/2001.09977) )

~~~
p1esk
What do you think about training a binary classifier to distinguish between
human and generated samples? E.g. choose a state of the art model designed to
classify composers or styles, and finetune it for this task.

~~~
gwern
I think that could work potentially as a rejection sampling, but you also have
the risk that it will simply find some small discriminative detail and be
unusefully good at the classification; that's why you do it in a loop as a
GAN, but as you mention, GANs work really badly on sequences, still, so...

If you wanted to improve my ABC-MIDI GPT-2, the most straightforward ways
would be to do data cleaning (I'm sure there's tens of thousands of awful MIDI
files which should be removed! data cleaning with RNNs or GPT-2 or GANs always
makes a large difference) and increase the model size (the fact that loss
bottomed at 0.20, which is still quite bad, suggests that MIDI is hard enough
that GPT-2 is struggling). More interesting would be to use Reformer or
another long-range Transformer and try to operate directly on a more raw
representation, like the the piano roll representation of MIDI. I think GPT-2
makes a lot of syntax errors which cripple outputs when a 'voice' goes silent,
and a piano roll representation would be a lot more robust (at the cost of
being like 10x larger).

~~~
p1esk
How would you present a piano roll to a transformer (e.g. what would be a
sample of the sequence)? You could try using a tuple of pitch integers for
each time step. I'm not sure how big a "vocabulary" would need to be to
capture most of the chords (note combinations) - it might actually be
comparable in size to a language vocabulary (tens of thousands of words). You
could use two channels to capture note onset/offset info (like it was done in
a biaxial RNN paper). Or the encoding used for Musenet (with explicit timing
info), but somehow I like the idea of "chords as words" better.

~~~
gwern
You would simply have a vector of 128 for each timestep (I'm not sure how
discrete MIDI goes), and any of the 128 instruments turned on would be non-
zero. It's not a dense encoding, by a long shot, but it's quite literal and
explicit so you don't have to worry indirectness making it hard to learn
(assuming you have a model which can handle such long-range inputs to begin
with).

Personally, I'd rather try redoing BPE encoding for ABC-MIDI specifically.

------
slfnflctd
Many comments here seem somewhat dismissive, but I heard a number of terrific
melodies & counter-melodies and would certainly consider working many of them
into part of a song. Also, a lot of the notes or chords which sound 'jarring'
to the average listener might work great in a jazz or contemporary classical
context, especially with added layering.

It's definitely interesting to me. Not sure I see a product in there... but if
it could be made more efficient and set up to generate smaller, tighter clips
with better instrumentation (after being a trained a bit more on what people
like), and had a few key features like time stretching, it could prove useful
to creative types. Writing a melody or progression doesn't always come easily,
and a little push can do wonders for writer's block.

~~~
snlnspc
This is my take as well. There are a lot of lines that would be right at home
in anything moderately jazzy or progressive, and plenty more that I would have
a great time using as a starting point.

------
econcon
Is there technology available through which we can change voice in real-time
on phone calls, I assume for it work latency has to low.

There is this project [https://github.com/CorentinJ/Real-Time-Voice-
Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)

But I found it pretty hard to run, and it doesn't have voice input.

~~~
narrationbox
(Disclaimer: we work on voice synthesis/style transfer and cloning)

Depends on how much fidelity you want and how much lag you are willing to
accept. Our current voice style transfer state of the art is sufficiently
capable though the results may still need anywhere from 6 months to two years
of development to be considered "production ready" i.e think poor quality,
noise, and artefacts in output audio with existing tech. Pasini. has a pretty
good blog post and paper on this:

[https://towardsdatascience.com/voice-translation-and-
audio-s...](https://towardsdatascience.com/voice-translation-and-audio-style-
transfer-with-gans-b63d58f61854)

------
jefftk
One thing I was hoping to see from GPT-2 is phrasing. Most of the folk music
they're using as input has a structure of internal repetition that lets you
know where you are in the tune. Most commonly: AABB structure, 64 beats total,
with every 2^n beats clearly grouped. Some of the examples seemed to have this
at the 8-beat level, but I couldn't find any with it at the 16, 32, or 64
level.

Here's a random snippet of a real tune where you can clearly hear the AABB
structure:
[https://www.jefftk.com/contras/tunes/cast64__starabovethegar...](https://www.jefftk.com/contras/tunes/cast64__starabovethegarter.mp3)

~~~
p1esk
You don't really need any AI to do phrase repetition patterns. Typically you
simply repeat it as is, or maybe change pitch. This can be done
algorithmically after the phrase generating stage.

It would be nice if the network learned these patterns (and transitions) on
its own, but it's far more important to generate quality phrases. That is the
hard part and as you can see from the samples we are not there yet.

~~~
jefftk
GPT-2 for text is impressive because it does a surprisingly good job with
larger scale patterns. If you give it a play it generates things with actor
marks, it's sentences are a reasonable length etc. I had been hoping this
would be similar to learning the structure of tunes, but it seems that
requires something we haven't figured out yet.

I do disagree, though, that phrase repetition is as simple as you say. A good
16-count phrase has patterns within it, and a good AABB tune has relationships
between the A and B parts.

------
p1esk
For state of the art in computer generated music check out
[https://aiva.ai](https://aiva.ai)

------
carrolldunham
the state of this technology after being called 'too dangerous to release'

~~~
ehsankia
As mentioned in the other comment, this is a very deceitful comment. GPT-2 was
made for generating text, and the specific thing it's dangerous for is
disinformation bots on twitter/reddit as well as fake generated articles, two
things the original model quite excels at. It's like claiming that nuclear
bombs aren't dangerous because they're not good at taking us to space.

~~~
carrolldunham
What I meant by state is status. As in, this type of humble retro styled
application is where we find GPT-2 in 2020 instead of dominating the news
cycle with what you claim it is so good at. If it excels at text, where are
the fake articles and tweets? It's just not that good.

~~~
djannzjkzxn
Have you been reading the mass comments/tweets on political stories lately?
Some of it does seem bot-like to me. I don’t have a strong opinion about how
much this is happening. But I know I could personally run a bot that generates
political spam with GPT-2. I know a lot of people want to influence the
political conversation. I mean, people get paid salaries to write comments
online. So I can’t help but suspect that someone is using tools that allow it
to be done a lot more cheaply.

~~~
seanwilson
Saying that it could be is wide use but you wouldn't be able to detect it is
an impossible to falsify claim though.

~~~
djannzjkzxn
The way to falsify it is to track down the identity of the human author of
every internet comment. Simple!

To get philosophical I won’t say that I “know” it’s happening. Most of the
world consists of things I don’t know about. The best I can do is build a
mental model based on what I do know.

------
8bitsrule
Probably should have left the title alone. There's no such thing as 'MIDI
music'. MIDI is a communications protocol to transfer encoded performance data
between electronic sources and sinks.

Algorithmic performance data is a thing, but generally not very musical.
There've been a couple of exceptions.

~~~
h-cobordism
> There's no such thing as 'MIDI music'. MIDI is a communications protocol to
> transfer encoded performance data between electronic sources and sinks.

Why can't I just as easily say that

> There's no such thing as 'sheet music'. Sheets are a communications protocol
> to transfer encoded performance data between human sources and sinks.

