Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

All the AI music I’ve heard so far has a really unpleasant resonant quality to it. Why is that? Can it be removed?


I've done some work on AI audio synthesis and the artifacts you're hearing in these clips are coming from the algorithm that is used to go from the synthesized spectrogram to the audio (the Griffin-Lim algorithm).

Audio spectrograms have two components: the magnitude and the phase. Most of the information and structure is in the magnitude spectrogram so neural nets generally only synthesize that. If you were to look at a phase spectrogram it looks completely random and neural nets have a very, very difficult time learning how to generate good phases.

When you go from a spectrogram to audio you need both the magnitudes and phases, but if the neural net only generates the magnitudes you have a problem. This is where the Griffin-Lim algorithm comes in. It tries to find a set of phases that works with the magnitudes so that you can generate the audio. It generally works pretty well, but tends to produce that sort of resonant artifact that you're noticing, especially when the magnitude spectrogram is synthesized (and therefore doesn't necessarily have a consistent set of phases).

There are other ways of using neural nets to synthesize the audio directly (Wavenet being the earliest big success), but they tend to be much more expensive than Griffin-Lim. Raw audio data is hard for neural nets to work with because the context size is so large.


Phase is crtical for pitch. Here is why. The spectral transformation breaks up the signal into frequency bins. The frequency bins are not accurate enough to convey pitch properly. When a periodic signal is put through a FFT, it will land into a particular frequency bin. Say that the frequency of the signal is right in the middle of that bin. If you vary its pitch a little bit, it will still hand into the same bin. Knowing the amplitude of the bin doesn't give you the exact pitch. The phase information will not give it to you either. However, between successife FFT samples, the phase will rotate. The more off-center the frequency is, the more the phase rotates. If the signal is dead center, then each successive FFT frame will show the same phase. When it is off center, the waveform shifts relative to the window, and so the phase changes for every sample. From the rotating phase, you can determine the pitch of that signal with great accuracy.


Yes, this is exactly right and is why Griffin-Lim generated audio often has a sort of warbly quality. If you use a large FFT you can mitigate the issues with pitch because the frequency resolution in your spectrogram is higher, so the phase isn't so critical to getting the right pitch. But the trade-off of a bigger FFT is that the pitches now have to be stationary for longer.

The other place where phase is critical is in impulse sounds like drum beats. A short impulse is essentially just energy over a broad range of frequencies, but the phases have been chosen such that all the frequencies cancel each other out everywhere except for one short duration where they all add constructively. Without the right phases, these kinds of sounds get smeared out in time and sound sort of flat and muffled. The typing example on their demo page is actually a good example of this.


So what is phase? From dabbling with waveforms in audio editors, sampling, and later learning a little bit about complex numbers, phase seems eventually equivalent to what would sound like changing pitch, modulating the frequency of a periodic signal.

The simplest demonstration of it is the doppler shift. But it's not at all that simple because moving relative to the source the sound pressure and thus the perceived loudness also change, distorting the wave form, thereby introducing resonant frequencies. Now imagine that the transducer is always moving, eg. a plucked string.

The ideal harmonic pendulum swings periodically, only losing attenuation. But the resonant transducer picks up reflections of its own signal, like coupled pendulums, which are intractable according to the three body problem.

On top of that, our hearing is fine tuned to voices and qualities of noise.


Phase is the offset in time. The functions sin(θ) and sin(θ + c), for arbitrary real c, represent the same frequency signal; they are offset from each other horizontally by c, and that c is a phase difference. It has an interpretation as an angle, when the full cycle of the wave is regarded as degrees around a circle; and that's what I mean by rotating phase.

When you take a window of samples of a signal, and run the FFT on it, for every frequency bin, the calculation determines what is the amplitude and phase of the signal. If you have a frequency bin whose center is 200 Hz, and there is a 200 Hz signal, then what you get for that frequency bin is a complex number. The complex number's magnitude ("modulus") is the amplitude of that signal, and its angle ("argument"d) is the phase.

If the signal is exactly 200 Hz, and if the successive FFT windows move by a multiple of 1/200th of a second, then the phase will be the same in succcessive FFT windows.

But suppose that the signal is actually 201 Hz: a little faster. Then with each successive FFT window, the phase will not line up any more with the previous window; it will advance a little bit. We will see a rotating complex value: same modulus, but the angle advancing.

From how fast the angle advances relative to the time step between FFT windows, we can deduce that we are capturing a 201 Hz signal in that bin (on the hypothesis that we have a pure, periodic signal in there).

How is the phase determined in the frequency bin? It's basically a vector correlation: a dot product. The samples are a vector which is dot-producted with a complex unit vector. The complex unit vector in the 200 Hz bin is essentially a 200 Hz sine and cosine wave, rolled into a single vector with the help of complex numbers. Sine and cosine are 90 degrees apart in phase, so they form a rectilinear basis (coordinate system). The calculation projects the signal, expressing it as a sum of the sine and cosine vectors. How much of one versus the other is the phase. A signal that is 100% correlated with the sine will have a phase angle of 0 degrees or possibly 180. If it correlates with the cosine component, it will be 90 or 270. Or some mixture thereof.

Because a complex number is two real numbers rolled into one, it simplifies the calculation: instead of doing a dot product with a sine and cosine vector to separately correlate the signal to the two coordinate bases, the complex numbers do it in one dot product operation. When we go around the unit circle, each position on the circle is cos(θ) + isin(θ). These complex values values give us samples of both functions. Exactly such values are stuffed into the rows of the DFT matrix: complex values from the unit circle divided into equal divisions.

If you look here at the definition of the ω (omega) parameter:

https://en.wikipedia.org/wiki/DFT_matrix

It is the N-th complex root of unity. But what that really means is that it is a 1/Nth step of the way around the unit cicrcle. For instance if N happened to be 360, then ω is the complex number whose |ω| = 1 (unit vector), and whose modulus is 1 degree: one degree around the circle. The second row of the DFT matrix has 1, ω, ω², ω³, ... the second row represents the lowest frequency (after zero, which is the first row). It captures a single cycle of a sine and cosine waveform, in N samples. The values in that row step around the unit circle in the smallest increment, so they go around the circle exactly once. The subsequent rows go around the circle in skipped steps, yielding higher frequencies: 1, ω², ω⁴ for twice around the circle; 1, ω³, ω⁶ for three times, ... we get all the harmonics up to our N resolution.


> on the hypothesis that we have a pure, periodic signal in there

That pure sine wouldn't generate any artefacts. It would result in a 200Hz output from the AI if it throws the phase information out. You wouldn't hear a difference unless its an (aptly so called) complex signal. Eg. 200 and 201 Hz layered is an impure signal with a period below 1Hz, far outside the scope. Eventually the signals will cancel out completely. [1]

The important point is, I think, that FFT doesn't simply look at the offset aka phase. Rather, 201 Hz looks like a 200 Hz that is moving. So it encodes phase-shift in the delta of the offset between two windows. For a sum of 200 and 201 Hz it has to assume that the magnitude is also changing, which I find entirely counterintuitive.

From the mathematical perspective, this seems like a borring homework, far detached from accoustics. So, I don't know. The funny thing is that rotation is very real in the movement of strings. If the orbit in one point is elliptic, that's like two sinusoids at different magnitudes offset by some 90 degree, in a simplified model. But it has nearly infinite coupled points along its axis. As they exite each other, and each point has a different distance to the receiver, that's where phase shift happens.

> If you look here at the definition of the ω (omega) parameter

I wasn't going to make drone, but I will take a look.

1: https://graphtoy.com/?f1(x,t)=100*sin(x)&v1=true&f2(x,t)=100...


I wonder if this could be improved by using the Hartley transform instead of the Fourier transform.


Considering Stable Diffusion generates 3-channel (RGB) images, maybe it would be possible to train it on amplitude and phase data as two different channels?


People have tried that, but the model essentially learns to discard the phase channel because it is too hard for it to learn any useful information from it.


Got any citations... that sounds like a fascinating thing to read about.


We took a look at encoding phase, but it is very chaotic and looks like Gaussian noise. The lack of spatial patterns is very hard for the model to generate. I think there are tons of promising avenues to improve quality though.


Phase itself looks random, but what makes the sound blurry is that the phase doesn't line up like it should across frequencies at transients. Maybe something the model could grab hold of better is phase discontinuity (deviation from the expected phase based on the previous slices) or relative phase between peaks, encoded as colour?

But the same thing could be done as a post-processing step, finding points where the spectrum is changing fast and resetting the phases to make a sharper transient.


That makes a lot of sense, I would be keen to see attempts at that.


I'm curious why, instead of using magnitude and phase, you wouldn't use real and imaginary parts?


There have been some attempts at doing this, some of which have been moderately successful. But fundamentally you still have the problem that from the NN's perspective, it's relatively easy for it to learn the magnitude but very hard for it to learn the phase. So it'll guess rough sizes for the real and imaginary parts, but it'll have a hard time learning the correct ratio between the two.

Models which operate directly on the time domain have generally had a lot more success than models that operate on spectrograms. But because time-domain models essentially have to learn their own filterbank, they end up being larger and more expensive to train.


I wonder if there might be room for a hybrid approach, with a time-domain model taking machine-generated spectrograms as input and turning them into sound. (Just a thought, no idea whether it actually makes sense.)


would it be an approach to use separate color channels for the freq amplitude and freq phase in the same picture? Maybe the network then has a better way of learning the relationships and there would be no need for the postprocessing to generate a phase.


RAVE attacks the phase issue by using a second step of training. I don't completely understand it, but it uses a GAN architecture to make the outputs of a VAE sound better.


Griffin-Lim is slow and is almost certainly not being used.

A neural vocoder such as Hifi-Gan [1] can convert spectra to audio - not just for voices. Spectral inversion works well for any audio domain signal. It's faster and produces much higher quality results.

[1] https://github.com/jik876/hifi-gan


If you check their about page they do say they're using Griffin-Lim.

It's definitely a useful approach as an early stage in a project since Griffin-Lim is so easy to implement. But I agree that these days there are other techniques that are as fast or faster and produce higher quality audio. They're just a lot more complicated to run than Griffin-Lim.


Author here: Indeed we are using Griffin-Lim. Would be exciting to swap it out with something faster and better though. In the real-time app we are running the conversion from spectrogram to audio on the GPU as well because it is a nontrivial part of the time it takes to generate a new audio clip. Any speed up there is helpful.


I think this is because the generation is done in the frequency domain. Phase retrieval is based on heuristics and not perfect, so it leads to this "compressed audio" feel. I think it should be improvable


The link is down now, so I don't know about this one. But most generated music is generated in the note domain, rather than the audio domain. Any unpleasant resonance would introduced in the audio synthesis step. And audio synthesis from note data is a very solved problem for any kind of timbre you can conceive of, and some you can't.


You're probably talking about the artifacts of converting a low resolution spectrogram to audio.


Can the spectrogram image be AI upscaled before transforming back to the time domain?


Yes it exists: https://ccrma.stanford.edu/~juhan/super_spec.html

But the issue is not that the spectrogram is low quality.

The issue is that the spectrogram only contains the amplitude information. You also need phase information for generating audio from the spectogram


Interesting, can't you quantize and snap to a phase that makes sense to create the most musical resonance?


What happens if you run one of the spectrogram pictures through an upscaler for images like ESRGAN ?


It sounds kind of like the visual artifacts that are generated by resampling in two dimensions. Since the whole model is based on compressing image content, whatever it's doing DSP-wise is more-or-less "baked in", and a probable fix would lie in doing it in a less hacky way.


The first ever recordings had people shouting to get anything to register. They sounded like tin. Fast forward to today.

Looking back at image generation just a year or two ago and people would have said similar things.

Not hard to imagine the trajectory of synthesized audio taking a similar path.


Presumably for similar reasons that the vast majority of AI generated art and text is off-puttingly hideous or bland. For every stunning example that gets passed around the internet, thousands of others sucked. Generating art that is aesthetically pleasing to humans seems like the Mt. Everest of AI challenges to me.


I think your comment is off-topic to the post you are replyng to. That wasn't asking about the general aesthetic quality - more about a specific audio artifact.

> For every stunning example that gets passed around the internet, thousands of others sucked.

From personal experience this is simply untrue. I don't want to debate it because you seem to have strong feelings about the topic.


Even if you remove the artifact, the exact same comment applies. It generates a somewhat less interesting version of elevator music. This is not to crap on what they did. As I said, they underlying problem is extremely difficult and nobody has managed to solve it.

I don't feel strongly about this topic at all.


> It generates a somewhat less interesting version of elevator music.

This iteration does, but that's an artifact of how it's being generated: small spectograms that mutate without emotional direction (by which I mean we expect things like chord changes and intervals in melodies that we associate with emotional expressions - elevator music also stays in the neutral zone by design).

I expect with some further work, someone could add a layer on top of this that could translate emotional expressions into harmonic and melodic direction for the spectrogram generator. But maybe that would also require more training to get the spectrogram generator to reliably produce results that followed those directions?


The vast majority of human generated art is hideous or bland. Artists throw away bad ideas or sketches that didn’t work all the time. Plus you should see most of the stuff that gets pasted up on the walls at an average middle School.


Hard disagree. The average middle school picture will have certain aspects exaggerated giving you insights into the minds eye of the creator, how they see the world, what details they focus on. There is no such minds eye behind AI art so it's incredibly boring and mundane, no matter how good a filter you apply on top of it's fundamental lack of soul or anything interesting to observe in the picture beyond surface level. It's great for making art for assets for businesses to use, it's almost a perfect match, as they are looking to have no controversial soul to the assets they use, but lots of pretty bubblegum polish.


Perhaps most of the AI art out there (that honestly represents itself as such) is boring and mundane, but after many hours exploring latent space, I assure you that diffusion models can be wielded with creativity and vision.

Prompting is an art and a science in its own right, not to speak of all the ways these tools can be strung together.

In any case, everything is a remix.


I have to agree, the act of coming up with a prompt is one and the same with providing "insights into the minds eye of the creator, how they see the world, what details they focus on" - two people will describe the same scene with completely different prompts.


And the vast majority of professionally produced artwork is for business use. It’s packaging design or illustration or corporate graphics or logos or whatever.

I don’t get the objection.


> For every stunning example that gets passed around the internet, thousands of others sucked

…implying there may be an art to AI art. Hmm.

Meanwhile, the degree to which it is off-puttingly hideous in general can be seen in the popularity of Midjourney — which is to observe millions of folks (of perhaps dubious aesthetic taste) find the results quite pleasing.


Not sure about this. Models like Midjourney seem to put out very consistently good images.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: