Hacker News new | past | comments | ask | show | jobs | submit login
Neural Style Transfer for Musical Melodies (ashispati.github.io)
99 points by lerch on Jan 11, 2018 | hide | past | favorite | 40 comments

The thing is there is no narrative or story telling element to image style transfer. Think of Beethovan's 9th symphony. It develops a narrative its not entirely abstract, the theme of the Song of Brotherhood gets subtly developed through the course of the composition. This is story telling. This is the same reason why its much easier to use ML to generate credible poetry. Generating a real novel with ML faces the same challenges as generating music, even instrumental music has a narrative and a story its telling. Story telling is far more difficult a problem than style transfer.

In one word: locality. Images have it narratives do not.

In another one word: language. Music has some elements of language that images don't have.

Interestingly though the failures of image generation are often of the missing narrative kind: eg. windowless bedrooms, eyes/ears at wrong places on animals. It feels like that deep learning is good at local but fails at global coherence.

>Why do we need Style Transfer for Music?

We don't. Most of the applications of ANNs to music that I've seen so far are solving problems that either don't exist or have been very successfully solved by much simpler methods with much better results.

If you want to apply AI to something useful in music, how about:

- Machine learning for reverb generation. I.e. being able to "record" reverb from real space and apply it to sound.

- AI/ML for high-quality sample transposition.

- AI for pure sound synthesis. Imagine something that does PCA a set of sounds and then allows you to generate new sounds by changing components with a bunch of knobs.

- AI for real-time sample morphing. For example, being able to synthesize a choir that sings the words you need. Or, how about transforming your own voice into a quality choir vocals? Hey, style transfer!

- AI for high-quality music transcription (sound to notes or MIDI). This is being done to some extent, but I don't think it's very good yet.

And so on.

I can assure you, every singe one of these would be something musicians would be willing to pay good money for.

- De-reverberation is being intensely (in some niches at least) studied with these tools, and adding the reverb back (if you can learn and remove it from mixed records) means you have probably modeled it well enough to do whatever you want to. But de-reverberation is very hard especially with single source (1 mic) recordings, maybe looking at better reverb generation is easier with the right data. It would be nice to do something that didn't require the physical setup of impulse response recording.

- I haven't heard the term transposition for samples - is this the name for moving samples in pitch and compensating in time so that it sounds undistorted?

- On pure sound synthesis, see NSynth [0]. It's a beginning step toward this with abstract sounds.

- Sample morphing and changing the words with voices is pretty tough (overlaps neural TTS heavily, which is pretty recently starting to work well), but if you can synthesize a single voice decently the rest seems doable with manual intervention. Check Neural Parametric Singing Synthesizer [1], or try the online demo [2]

- Every few months there is a new convnet which improves the baselines for transcription, hopefully it is only a matter of time. The current best is quite a bit better than only a few years ago.

There is still a huge gap between "what is possible in research" and "what is easily usable in a practical editing/creation workflow" on a typical DAW, but hopefully that gap will diminish over time.

There are also a lot of tools from statistical signal processing that were doing these tasks to some degree, digging them back up and "neuralizing" the probability parts with neural nets is a promising way to get quality improvements, and would mirror what has been successful in neural TTS so far.

[0] https://experiments.withgoogle.com/ai/sound-maker

[1] http://www.dtic.upf.edu/~mblaauw/NPSS/

[2] https://www.voiceful.io/demos.html

>- I haven't heard the term transposition for samples - is this the name for moving samples in pitch and compensating in time so that it sounds undistorted?

Yes, that's exactly it. There are traditional algorithms that do it, but most are variations on time stretching and granular synthesis. They aren't as good as they could be. A good algorithm would be able to take in a handful of examples and generate a model that seamlessly interpolates across different pitches and velocities, accounting for differences in timbre, attack, decay, noises and so on.

Thank you for the links.

Recording reverb from real space is as simple as recording an impulse (firing a gun or anything that makes a short high intensity noise) in a room and then convolving that with the audio you want to add reverb to.

The only limits to this is how close to a perfect impulse you can generate.

Well, yes, that's how convolution reverbs work. Except I've never seen anything that would allow users to record a reverb with conventional hardware and without any elaborate setups.

Also, there is more to reverb than just impulse response. A more sophisticated reverb simulator would map out the space and allow you to choose where the "listener" and the source are located in that environment.

The nearby comment about moving obstructions is on-point as well.

There is ample space for application of machine learning here.

Reverb of a static room at a point is as simple as that. Rooms with moving obstructions, with moving noise generators, at moving points, that's harder.

You really might not even need to make a great impulse... you can probably get away with recording response to a series of "known signals" (e.g. chirp signals played back over a known loudspeaker), and then determining the impulse response from that, and applying that (via convolution) to your actual signal of interest.

We're quite lucky that neural nets are overkill for procedural music generation.

By that I mean, we have huge masses of music theory, applied across genres and focusing on differences between genres, in terms of heuristics for both analysis & composition, and because of a tradition of procedural generation of music that goes back a couple centuries, a lot of it is fairly easy to translate into computer programs. (For instance, end-to-end serialist composition is easier for computers than it is for people, while canons and other mechanisms for creating permutations of melodies are equally straightforward.)

This doesn't translate into a straightforward method for putting in two wav files and producing a third with transferred style, but it does mean that a sufficiently motivated person can write something that translates notes between two known genres with greater ease than they would with images.

(Text is somewhere in the middle. I've worked on a couple attempts at 'style transfer' for text -- mostly using word2vec.)

I think you’re overestimating the power of music theory to serve as a basis for generation or modification of music, and also the scope of music that it explains. Rhythm has been central over the past 100 years since Western pop music has a lot of its roots in American blues. However there is surprisingly little that music theory has to say about rhythm or groove.

There are a lot of regularities and patterns in harmony and harmonic sequences which music theory covers, but there are also a combinatorial explosion of melodies that will be justified by music theory in a particular harmonic context. The choice of which melodic path to go down is very poorly constrained by music theory.

Agreed, although I generally think in terms of "obeying traditional music theory (to some extent) is necessary but not sufficient for a listenable melody". This also changes depending on genre and era - one reason for targeting early Western music, such as early two or 3 voice counterpoint is that composition was more regular and in accordance with theory (which was kind of codified after the fact, focused on explaining these types of composition), though the "reward problem" remains. Coming around to free jazz or extremely "modern" composition means violation of most or all rules while still being "musical" (at least to fans of those genres) - that is going to be even tougher and we are pretty far from automated generation of long multi-part composition without some extra hints from theory built into the models and data, even for music that closely follows theory.

This is one reason many people in the space are focused on new tools for creators, instead of "automated creation" - what is "cool/interesting/listenable" is ill-defined, but making something which allows creators to explore the sonic landscape in really different ways seems a lot more plausible to me - more "weird synthesizers with neural networks" than "robo-artists".

Theory turns out to be nowhere close to sufficient for defining musicality.

If you write a rule-based counterpoint solver - this has been done, with varying levels of success - you'll find that not only do you have to include an extra set of rules that aren't defined in any of the standard texts, but that the best output you can expect is musically mediocre.

The other approach is to create a patchwork of idiomatic cliches. That usually sounds more convincing, but still isn't musically interesting. And you can usually hear where the edges are glued together.

It turns out that there is no theory of "good" music. It literally doesn't exist. All the standard texts for each style - and it doesn't matter which style you pick, from Palestrina to pop - are very incomplete guides that rely on human intelligence and creativity to fill in the gaps.

Throwing neural networks at this problem doesn't make it any easier, because no one knows what to look for.

Features in artistic style transfer are easy to parse - shape, texture, and that's pretty much it, all packaged in a ready-to-go 2D distribution.

What are the musical features that define not just one possible musical style, but all of them, and would allow anyone to morph smoothly from Wagner to Taylor Swift to Balinese Gamelan to DubStep?

On rule based solvers - this is true if you simply stop at "be within the rules" and do strict constraint solving ala the coloring problem.

I think there is potential to do something by blending all 3 (rules and constraint checking, rewards (perhaps based on idioms, riffs, and cliches), and some additional discriminator/critic/metric learning) if you narrow the criteria sufficiently (3 voice counterpoint in the style of Josquin de Prez for example).

I would really be pleased to see something that can fill out "theory 101" counterpoint worksheets against a cantus firmus, even if the results would be poorly graded on "style" it is someplace to start getting feedback and collecting data to try and quantify that it factor for a narrow narrow subsection of the wide world of music.

I have a very weak grasp of music theory and write a lot of music generators based on that weak grasp. They are mostly successful, compared to my prose generators.

It's not so much that I'm overestimating music theory, but that the state of really concrete analytic structures in other arts (in terms of being well-positioned to produce new works from PRNG output) is pretty bad.

It's pretty straightforward to construct a program that produces mediocre music that really resembles human-written mediocre music. Computers are also really good at producing poetry that is indistinguishable from the work of mediocre human poets. Computers are not good at producing images that look like they were hand-drawn by a 12 year old Deviantart user or producing stories that read like they were taken from fanfiction.net

Generating rhythms is a problem that has been solved by arpeggiators, step sequencers, analog modular rigs and more sophisticated tools like KARMA. You don't need machine learning for it.

This is a greatly oversimplified view of rhythm. There are many rhythms and grooves that do not lock in with "the grid". These tools will most likely not produce a natural pattern of velocity that sounds appropriate for the generated rhythm. Step sequencers are a tool for inputting rhythm, not for generating it. Arpeggiators typically have a consistent rhythm (hitting on every one of some subdivision).

Modern step sequencers (for example, Elektron boxes) are way more than just a grid. They have microtiming, parametrized triggers, parameter sliding, probabilistic and conditional triggers, and are capable of running multiple patterns of varying lengths that reset at different rates.

Ok then tell me how to generate a drum part 30% of the way from Buddy Rich to Neil Peart.

I think the issue with music theory is its actually an after thought. A bunch of guidelines to reproduce something that is often first developed by simply pleasing subjective taste. Example: Tonal harmony didn't really lend itself to 'blues' Idom7-IVdom7-Vdom7 but the later jazz harmony was developed to accommodate it.

There are some fun examples of this sort of stuff on http://dadabots.com/ , which includes attempts to synthesize music in the style of The Beatles, Meshuggah and The Dillinger Escape Plan.

If you like the idea of taking a song and changing its genre, check out Postmodern Jukebox. They’re pretty brilliant.

Or Richard Cheese. Or Nouvelle Vague.

All lots of fun.

I find this surprising, from the simplistic (and probably naive) view that images are 2D signals while music is 1D.

"Style transfer" also rarely works for object level transfer - it is more pattern based (high frequency content is often the "style" that is enhanced and transferred). Really nice transfers in practice sometimes require the object level content in the images to be similar, c.f. [0][1]. And all of this is coupled with really heavy human curation (people don't normally show their bad outputs)!

In music the "style" is the content in some sense. For example jazz has very different "style" than classical, at many levels (key and tempo choice/mode choice/melodic intervals/motifs/amount of repetition of said motif/how it varies/harmonization and chord choice/global structure (AABA format)) and it isn't easy separate what pieces make it "jazz", and what don't (what factors of variation matter).

The equivalent in images would be replacing objects as well as texture, to form a new image that is reminiscent of the original but also novel at multiple scales - think Simpson's "Last Supper" as the goal of a style transfer [2].

It is also hard because as consumers we are used to hearing high quality versions of these types of "style transfer" for some styles all the time - and we even have a name for it ... "muzak".

[0] https://raw.githubusercontent.com/awentzonline/image-analogi...

[1] https://github.com/chuanli11/CNNMRF

[2] http://s267.photobucket.com/user/wiro_bucket/media/last%20su...

I think "cover song" is a more generic term than "muzak" for musical style transfer.

It can be, though some covers are "straight up", while others (generally the memorable ones) are practically a new creation in themselves, with a sliding scale in between. But for "cover song" meaning something like Hendrix's "All Along the Watchtower" or Coltrane's "My Favorite Things", I agree.

Thanks for this detailed explanation. The Simpson's Last Supper analogy really made it click for me.

Even images, from an analysis perspective are far more than 2D. Think of the problem space: the types of lines, shading, coloration, blending of colors, overall structure, etc. All of this must come into play with style transfer.

With music, it's much more complicated due not only to the time domain, but also to the emotional content, the interrelationships of rhythm, tempo, chordal progressions, expression, timber, transitions from the simple to the complex, the way the music is layered, and how all these things are perceived by the listener. Like human faces, music elicits an emotional response that runs deep, and whereas the emotional content of a picture is relatively static, music's emotional content is dynamic and evolving. Music has its own "uncanny valley" between the machine and the human.

Think of a MIDI track that is generated by precisely laying out notes and velocities on a piano roll, vs. a track that is recorded live and captures the dynamics of the player. The difference is immediately obvious, and that with a track that is composed by a human.

Sound (as we perceive it) is 2d by the time it reaches the brain. The cochlea works as a "discrete" fourier transformer, with little hairs that resonate around certain frequencies. It is the fact that sound could be encoded in 1d that was, at one time, astounding.

Much more leeway for "unintended cheating" with 2D signals. Think of all the imperceptible patterns that could be used as distinguishers. ML methods basically search for these shortcuts.

Wouldn't music be 4D, ąs sound in space through time?

Only if you have a different track playing at every point in space.

Music isn't any more 1D than language is.

We perceive sound two-dimensionally by spectral decomposition in the cochlea, so theres that...

how about 2d in frequency domain?

"Music is NOT well-understood by machines (yet !!)"

I wrote this blog post, with some data that might help improve that.


Finally someone gets it. At least a little. :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact