Hacker News new | comments | show | ask | jobs | submit login
Making a Neural Synthesizer Instrument (tensorflow.org)
147 points by jessejengel 6 days ago | hide | past | web | 44 comments | favorite

This is interesting, although the results are underwhelming IMO. I actually had the same idea - finding the latent space of instrument sounds and using that for synthesis - a couple of years back. After countless hours of research I managed to turn it into a commercial software instrument called "GalaXynth" [1]. For me at least, it turned out that the "automatic" latent space (i.e. discovered by autoencoding) isn't that interesting musically, therefore I turned to hand-designing the latent space, which was a gargantuan task that I'm not sure I would do over again if I knew how hard it would be. Anyway if anyone is interested in this type of thing you should get in touch!

[1] https://heartofnoise.com/products/galaxynth/

Looks like this project made similar trade offs:

> Because the WaveNet decoder is computationally expensive, we had to do some clever tricks to make this experience run in real-time on a laptop. Rather than generating sounds on demand, we curated a set of original sounds ahead of time. We then synthesized all of their interpolated z-representations. To smooth out the transitions, we additionally mix the audio in real-time from the nearest sound on the grid. This is a classic case of trading off computation and memory.

Just wanted to say that I admired GalaXynth when it came out. I still have the demo and mean to give it another go sometime. As you're probably well aware the justifications to buy a synth are hard to pin down and if you made a second run at the concept with a focus on fleshing out "traditional" features I bet there would be some breakthroughs.

Thanks, appreciate the feedback! Others are pointing me in the same directions, so I'm working on adding stuff like filters to push it more in the "canonical" direction.

Pretty cool. Thanks for investing time in this and producing the videos.

The synth performer in me is longing for a traditional ADSR-type of interface, such that I can emphasize the attack and delay of a bowed-string instrument, sustain like a reed, and release like a horn.

Perhaps that's achievable with Galaxynth?

Yes, that was the main feedback I got after releasing it and since version 1.1 there is a modulation section where any parameter can be modulated with envelopes and LFOs, including what sound is played. I never got around to making a video about it, though.

This is such a cool idea! I love the UI you came up with for blending sounds. It's so immediately obvious how it works after seeing/hearing the demo vid. Super awesome.

wow man this looks awesome. and like an insane amount of work. thanks for sharing!

No Linux version? :-(

It's on my todo-list! But as I understand it from other devs, the market for commercial audio software on linux is pretty small.

You could probably host the windows vst with Wine.

I mean, it's a cool project, but I'm not sure the latent space between statistically learned sound representations is that interesting for music... For speech synthesis I can see the applications.

For musical sound generation I think there is still room for new discovery with physically modeling sounds, like these guys are doing - http://www.ness.music.ed.ac.uk/. You can morph between physically modeled instrument sounds too, and it's all realtime. Complex physical models tend to have too many parameters for human performers to control so maybe neural networks could be used there to learn to control the params and their interaction in a meaningful way for musical sound production.

Also, since WaveNet is computationally so intensive, training and re-constructing sounds sample by sample, at 16k per second at 8bits turns out to be pretty heavy going - let alone 44khz/16bit. So to make it realtime this implementation is basically interpolating in a grid of pre-rendered wavetables to morph between instruments, right?

I think Yamaha made a physics based synthesizer. Only a few were produced because they were expensive and hard to master. Can't remember the name though.

The idea is cool, but the execution is lacking. My expectation is that if you turn the slider fully to one or the other side, you would get a perfectly clear representation of that instrument. In reality, you get this dirty synthetic sound no matter what and the result of blending them together is always "dirty" and they all sound similar in the end.

I understand that it is hard to encode the sound into these parameters and get near perfect decoding, but maybe that's the next step?

I think they just compressed too much due to computational constraints. They say that themselves. However, there is always a question of rate control in these autoencoder methods, and also the error function. In the original paper they don't seem to use a very good perceptual error function.

I had the same feeling. For instance, the sitar sound is not comparable to a sampled one. I think this idea makes sense for real synth sounds and not for imitations of already existing instruments.

Biggest missing part of the article is the "Music" promised in the title.

In other words: I was hoping to hear a musical composition or performance using samples generated by WaveNet.

I think a better, more realistic title for the article would be more like "Generating Experimental Sounds Using WaveNets" or similar. Even the authors inclusion of animal sounds in the generation made this more of a novelty and not really useful for actual musical sound generation.

Would be interesting to compare this approach with other "Less neural-net-y" approaches, like combining samples in the frequency domain and mixing samples over time.

We reverted the submitted title “Make music with WaveNets” to that of the article.

Wow, thanks

'Neural audio synthesis' really misses the point. This is just the same technique applied to emulating real instruments, which is already a massively well solved problem, and unlikely to result in any significant improvement.

Music synthesis, on the other hand, is an unsolved hard problem. For example, producing harmonic accompaniment to a hummed melody, or rendering a given chord sequence in different musical styles, or making a drummer that can pick up where someone's beatboxing left off..any of these have the potential to reach a huge audience.

Synthesizing real instruments basically says 'we know nothing about this space and are going to ignore all the works that's already been done in it.'

> This is just the same technique applied to emulating real instruments, which is already a massively well solved problem

I disagree. Most instrument synthesis techniques that I've heard don't yet approach real physical instruments. I say let them try using these techniques to an old problem and see if they can get any improvements.

Automated musical expression and improvisation are also active areas of research. (One interesting example: https://www.youtube.com/watch?v=qy02lwvGv3U - though I wonder how much it is listening and responding vs how much is pre-programmed.)

Inability to get a particular sound you want in a well-equipped audio production studio is not a problem in 2017.

I have never heard emulation of a physical instrument that matched the real thing. For instance movie studios hire musicians to fill in full orchestras for that reason. If they could get the same result without the musician I'm sure they would.

".. or rendering a given chord sequence in different musical styles"

Shameless plug for my RNN trained piano music generator which can do exactly that:


It can also generate novel chord sequences and compose to those.

Seems like a neat-ish idea, but the resulting samples are very meh. Not high fidelity sound.

If you really want to see the interesting stuff going on with software synthesizers, I'd strongly suggesting checking out Steve Duda's Serum[1] wavetable synthesizer, or Sonic Charge's Synplant[2], a "genetic" synthesizer.

Steve Duda's elite master class is also super fascinating, if you've got the time! [3]

1: https://www.xferrecords.com/products/serum

2: https://soniccharge.com/synplant

3: https://www.youtube.com/watch?v=MOUkI5hH2HY

this is awesome. this is the kind of thing that spawns entire genres of music that didn't exist before. a lot of the demos sound terrible but as the modulation and drum examples allude to, it's going to be super weird and interesting when more disparate sources are interpolated. I can't wait to figure out how to "break" this.

If you think hard enough you can probably come up with 10 music tech projects that had similar awesome promise and never delivered on it. Really, it's not like we're short of ways to make new weird timbres or timbres that are oddly redolent of others..but if weird is all you need, you can just buy a modular and be half-way to outer space already. As you know there, there's typically a vast waste of sonically uninteresting space in between the sweet spots - one reason I've become suspicious of synths whose primary claim is the broadness of the sound palette, because that promises endless tweaking for ultimately unsatisfying results.

i think what is appealing to me is the idea of playing unexpected combinations of sounds of one another. and it's the drum example in particular that really caught me.

you can certainly make weird sounds with existing synths, but interpolating rhythmic sound with a harmonic sound is different to me in that the resulting thing is more rooted in a musical context and can work with other non-neural elements more easily.

for example, once you get some sort of intuition for how sounds might meld, you could compose a "beat" made up of samples (maybe drum sounds, maybe not) in the "left" side that is tailored to interact in certain ways against the "right" (i'm referencing the UI in the abelton video).

people might trade their "seed" sounds, or they might keep them close to the vest!

probably you could use max msp to do stuff like this already but i'm imagining that the "left" sound itself being thought of like an intuitive signal processing algorithm.

it's like second order sampling. you can find pieces of audio and, rather than use them directly, as today, you can create a third sound that probably can't be deconstructed back to the original.

might not birth a top-level genre like sampling did hip hop, but i think once someone puts it together the right way, and once processing power allows them to go beyond some the limitations described, it will really open some new avenues

But you can do this already using tools like Tassman or one of the many spectral convolution/resynthesis tools. And if you have sufficient money to throw at analog or sufficient computing power to run a large digital modular tools like Reactor (or any number of others) you can do so many wild things with bandpass filters and envelope followers as modulation sources. I have a Nord Modular sitting next to me and realistically that and a sampler offer more timbral possibilities than can be explored in a single human lifetime.

I don't want to just piss on this, of course any new technology is interesting. But synthesizing novel timbres is just not a big deal in 2017. 'Just imagine what's possible' is still a great marketing line, but anyone waiting on some new technology to make sounds that nobody has ever heard before is suffering from a failure of imagination rather than a limitation of technology.

think of it this way, I could probably hop over to my local biohacking lab and find some way to map audio data onto DNA, modify it, and read it back out again using CRISPR. It would definitely be possible to encode audio information in DNA form. You know perfectly well it won't automatically give you more 'organic' or 'natural' sound despite the novel fact of doing the computation on a biological substrate, and you also know perfectly well that it would be marketed that way, just like almost every other synth is marketed on the basis of its wild creative possibilities.

It's like showing off your new graphic manipulation software with a picture resembling the Mona Lisa. You're selling some basic tools, but people are buying into the idea that having the tools will endow them with increased artistic ability. In reality everyone likes the new tool or filter you've come up with, it spreads rapidly to the point of over-familiarity, and then becomes fairly standard in future toolkits after the novelty has worn off.

Synthesized musical instruments never sound quite right (the best I've heard are the vienna symphonic library: https://www.vsl.co.at/en). While that doesn't appear to be the goal of this specific work, some of the wavenet approaches seem like they could be used towards that end. Even if this requires rendering the audio for an instrument slower than real time it would be a nice achievement if it can improve the quality. (Studio musician jobs I think are safe for quite a while still.)

Most instruments can make a wide range of different sounds, and players can move smoothly between the different sounds by playing the instrument in different ways.

This is really a kind of morphing. You can capture examples of each kind of sound with sampling, but you can't capture the performance morphing. Even if you could, there's no good way to perform the morphing with a typical synth keyboard, which only allows for velocity and maybe aftertouch - possibly poly AT for a handful of models.

So these huge sample sets have started using rule-based systems to try to add the morphing, or at least to make sample choices, in a context-sensitive way. This kind of works, up to a point, but it's not as good as the real thing.

As a side effect, sampling has driven jobbing composers, especially in Hollywood, towards an industry standard mechanical and repetitive orchestral sound.

It sounds orchestra-like, but it's a narrow and compressed version of all the colours an orchestra is capable of. If you compare it to the work of master orchestrators - Ravel, Stravinsky, Puccini - it's not hard hear just how flat and colourless these scores are.

A good ML model of an orchestral instrument would be a very useful thing, because it would make it possible to think about breaking out of the sampling box. But there aren't enough people with enough of a background in both ML and music to make this likely.

Sadly, I think it's more likely we'll get even more compressed and narrow representations, with even more of the subtlety and expressive range removed.

Modern virtual instruments are capable of much more than what standard Hollywood soundtracks might make you think.

1) Performance morphing. We have moved from straightforward sampling to hybrid sampled/synthesized approaches. It will never be as good as the real thing, but it already allows for richer performances than what a boring player would do. Here is an example of a virtual clarinet (Sample Modeling Clarinet). I sequenced many variables separately to demonstrate: vibrato depth, vibrato speed, legato and portamento speed, growl, pressure and accent on the attack.


2) Extended techniques. Competition has encouraged virtual instrument publishers to go for the unusual stuff, and fill whatever niche hasn't been filled yet. For example I recently acquired a library specialized in extended cello technique (Jeremiah Pena Mystic). I used it in the soundtrack of a no-budget short film, here's an excerpt of the cello part:


Anyways, I agree that Hollywood soundtracks have been converging to standardized styles, and sampling may be to blame historically, but it is hardly a limiting factor anymore. If anything, it should now encourage creativity as it partly removes the fear of wasting massive resources when your experimental score ends up sounding like crap at the recording session.

It's not purely the sampling that's doing it to Hollywood: In some fashion, it's the heavily-derisked blockbuster formula to blame, and technology comes along for the ride.

Tony Zhou's Every Frame A Painting [0] offered a take on how the tendency is to work very closely to a temp track and then ask for something identical, which of course can only get you increasingly similar sounds. Dan Golding responded to this by adding some nuance, noting that temp tracks have always been in use, so the answer has to be a little more complicated, and he points back to the technology. [1] I would say that the technology is just a piece of the puzzle; you can order in a different type of sound and get it, whether or not you're using a computer-heavy approach. That's aptly demonstrated by the variety seen in indie games, for example. This is a problem that movies have made for themselves by being focused on fitting everything to a formula. The occasional film does slip through that has a great score that draws on something bigger than other films(for one example: Scott Pilgrim vs the World).

[0] https://www.youtube.com/watch?v=7vfqkvwW2fs [1] https://www.youtube.com/watch?v=UcXsH88XlKM

Isn't Vienna Symphonic Library sampled instead of synthesized?

Yep. Use their soundfonts.

They should try to incorporate ideas of sound generation from samples which were established long ago, patented and then the area completely forgotten. Patents have since then expired.


This algorithm works wonderfully for guitars (there are improvements to it too). There was a synth 20 decades ago, if not more, I believe, that mimicked all of the wind instruments extremely well (you had a pipe through which you could blow and a keyboard to choose the sounds).

Unfortunately the whole area seems to be abandoned due to patents on algorithms.

Looks wonderful that WaveNets are producing something so well, although the sound still needs improvement.

The list of software synths using and expanding on this algorithm piles up pretty high and is still rising. Perhaps you don't hear the name a lot because most of these plugins don't explicitly say so in their descriptions and marketing (although many do). The more common term is physical modeling, which basically implies Karplus-Strong and more advanced delay lines, waveguides, etc. as underlying algorithms. For example, Applied Acoustics Systems has been researching and publishing this kind of software synth for maybe 20 years. Native Instruments has also made a bunch of stuff clearly using Karplus-Strong, including patches for their popular Reaktor synth. Heck, I've made plugins using this algorithm myself.

I find pure physical modeling has been stagnating though, hybrid approaching with sample-based synthesis seem more promising right now. This is what Sample Modeling has been doing and their results are impressive.

How does the performance of algorithms hold? I'm seeing all these old reports of how it's computationally difficult, yet can't find any performance reviews on new hardware. Not mentioning FPGAs or whatever else.

Lots of Karplus-Strong synths still around as Audiounits or VSTi's.

The wind synth was Yamaha VL1 - quite expensive at the time but very expressive. Not sure what happened to the patents on that, but there have been other physically modeled synths since then.

Stefan Bilbao's group at University of Edinburgh is carrying the flag on next-level physical modeling, though I think the funding for the project has come to an end unfortunately.


There was cool work on physically modeled instruments done in Finland by Vesa Välimäki et al, not sure what the state of that is now.

Thank you for the comments, couldn't find any newer papers in the subject that advanced the methods.

Yes, it was VL1.

If you take synth sounds as inputs you can just mix the parameters without any neural net and get the same result...

Sadly does not sound new at all.

Check out the artist TCF. He claims his songs are all algorithmically generated using neural nets. Wicked sounds.

Very disappointed i cant blend cats and dogs

This is a little over my head. Could you please explain how to use it?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact