Hacker News new | past | comments | ask | show | jobs | submit login
Highres Spectrograms with the DFT Shift Theorem (soundshader.github.io)
124 points by gbh444g 9 days ago | hide | past | favorite | 69 comments

It's unfortunate that the article doesn't get into the fundamental limits of spectrogram resolution which are based on the famous uncertainty principle(https://en.wikipedia.org/wiki/Fourier_transform#Uncertainty_...). For example there is a fundamental tradeoff between frequency resolution and time resolution similar to the position/momentum tradeoff in quantum mechanics. The Continuous Wavelet Transform which is alluded to in the article is a way to tune that tradeoff by frequency bin to best align with human sound perception.

> there is a fundamental tradeoff between frequency resolution and time resolution

I've always found it interesting that while that's fundamentally true in terms of information, my understanding is that we perceive things with far more resolution than the uncertainty principle would allow. Specifically, we're able to judge frequencies with far more accuracy than a fuzzy spectrogram would suggest.

From what I understand, our brain essentially performs a kind of "deconvolution" on the fuzzy frequency data to identify a far "sharper" and defined frequency, which is relatively straightforward since the frequency "spread" is a known quantity.

This works well most of the time because we correctly assume we're dealing with relatively isolated sound sources emanating a distinct fundamental with a distinct series of overtones.

Our perception can become innacurate when that assumption fails to hold, and so sounds merge or become indistinguishable, we hear beat tones that don't technically exist, our brain gives up trying to hear frequencies and classifies it all as noise, etc.

I've never come across audio spectrogram software that attempted to perform a frequency deconvolution in a way that roughly simulates what our own ears do, but I'd love to know if anyone else has and could point me to it.

Our brain and ears perceive sound not as separate frequencies but as band of frequencies. Humans are not able to differentiate frequencies that are very close to each other because of that. We can identify separate musical notes because they fit into separate bands.

Amir from AudioScienceReview did a good introductory video about the psychoacoustica as well as frequency response in general https://www.youtube.com/watch?v=TwGd0aMn1wE

Another part is that the human hearing is probably closer to Empirical Mode Decomposition (EMD) than a Fourier variant.

With EMD, a phantom "beat frequency" would actually show up in the transform space.

The purely algorithmic way to do it is the Wigner-Ville distribution, but it isn't practical for complex sounds due to the quadratic explosion of interactions between all time-frequency components. For a small number of well-separated 'chirp' signals it can give you exact localization.

I think the software you are looking for would have to be based on a machine learning rather than purely theory-based approach if its intended for use with natural sound signals.

I’m pretty sure I heard somewhere that the ear does autocorrelation rather than a Fourier transform, but I’m not sure how correct that is.

EDIT: Scene_Cast2’s comment below says it’s Empirical Mode Decomposition, not autocorrelation.

Anyway, I see no reason why spectrograms have to be fuzzy… a wide window size can locate frequencies very precisely while smoothing out fast variations in amplitude, which sounds pretty similar to how we hear things.

(Interestingly, when analysing the voice, linguists tend to use the opposite: a narrow window size, which smears out frequencies making the resonance bands more obvious, while allowing visualisation of fast glottal vibrations.)

Search for voice coding.

That's how many voice coding algorithms work, you try to find a digital filter that generates a sound that is as close as the original according to a perception based metric, then transmit the filter coefficients.

I don't remember the exact details, but if I'm not mistaken generating this sort of metric is really time consuming.

Thanks! That's definitely along the same lines, although that's for the special case of only a single fundamental frequency (with overtones). And I'm not sure it uses deconvolution -- I've heard of maximum likelihood estimation being used, in order to get a single frequency, though the ideas are closely related.

Using deconvolution would be more for the purposes of cleaning up a general-purpose spectrogram for human eyes -- for analysis and for sound editing, whether a single voice or a band of several instruments.

I did experiment with CWT in past [1] and was disappointed, to be honest. Not only it's grossly slow and complicated, it hardly gives more fidelity than plain FFT, it has the "window problem" which makes the low freqs too blurred and the high freqs too sharp, and it has the "wrapping ends" problem that makes it necessary to pad the input (about 1 million samples at least) with sufficient zero padding on both ends, as otherwise the two ends will interfere with each other.

Below is my GPU-based CWT that's 50x slower than the JS-only version in the post above.

[1] https://soundshader.github.io/?s=cwt

I've been wondering about the apparent contradiction between the limitations of spectrograms and the remarkable fidelity of MP3 files, which I thought operated along similar lines.

When you convert a spectrogram back into sound it sounds like crap, but then how does MP3 store the frequency information (and why can't we use that for visualizations)?

The math is beyond my understanding, can anyone give some kind of analogy maybe?

> When you convert a spectrogram back into sound it sounds like crap

fft gives you the spectrum + the phase. if you only use the spectrum to resynthesise you're missing half the information. temporal domain <-> spectral domain is a 99.9999999% lossless (not 100% I believe because of floating-point shenanigans, but enough to not matter at all) transform in both directions.

I think the trouble you're running into is that a spectrogram discards phase information so it's not informationally complete, and impossible to perfectly invert. Basically, a Fourier Transform represents a sound as a series of many sound waves at different frequencies added together. In order to make a pretty picture, the phase is thrown away, and only the magnitude of each wave is shown. The trouble is, to go back to a pleasant/accurate sound, we need that phase information that is missing.

I was thinking this is the case, until I stumbled upon a stackoverflow question that explains how to recover the phase data from overlapping FFT frames. The key word here is "overlapping".

I implemented a simple clone of mp3 and it was not that hard. If you do a discrete Fourier transform of the audio (in small overlapping windows), quantize the resulting coefficients, and compress them losslessly using the Huffman codes, you will end up with something not that far from mp3. The human ear is quite forgiving to the effects of quantization in frequency domain.

MP3 does not have remarkable fidelity though. MP3, and my clone of it, suffers from time domain artifacts. Quantization in the frequency domain causes distortion in the time domain as well, negatively affecting high frequency transient sounds like cymbals. That is more noticeable. Newer generation codecs like AAC handle transients much better, but they are considerably more advanced, and often use different transforms like wavelet transform.

The general concepts are described here: https://en.m.wikipedia.org/wiki/Psychoacoustics

I'm not sure what you mean by converting the spectrogram to sound, but my guess is that the windowing done on the short-time Fourier transform is causing artifacts.

My hypothesis: it is stored magnetically (after all magnetic sinusoidals exist) and converted electrically once the mp3 is activated in time.

I’m answering further: it is a result of the software programs limitations to reconstruct the mathematically perfect electromagnetic waveform quantized beforehand which is causing noise. No one writes perfect code.

That's a funny thread. Plato distinguishes the world of ideas from the world of matter. The latter is all about magnetic fields, while the former has none of them, as it's the world of archetypes, hilbert spaces and fourier transforms. So mp3 belongs to the world of ideas, it's a concept that expresses a projection of an abstract waveform to a specific basis of functions. Sound represents a particular embodiment (aka a mechanical wave) of an abstract waveform in the world of matter. The fact that this mechanical wave is driven by magnetic fields is irrelevant and is also a tautology, because in the world of matter everything is driven by magnetic fields.

What do you suppose is frequency content or information? It must principally exist in an analog continuous-time domain before being quantized and cached, correct? So how do these quantities, when processed, not correspond to what is ultimately electromagnetic?

I keep reading by commenters that a purely mathematical space which operates on physical movement has no relevance to physical movement vis-a-vis spectral analysis and particularly in how this analyzer - again embedded with recursive electromagnetic charges (software code) - does not affect the original time-sample. I am aware now that MP3 encoding is capturing more information than the analyzer - by design - but how can electromagnetic resonances not be considered in the discussion to warrant continual downvotes?

(I'd like to remind readers that downvotes are not made available for all users, but only those who meet a certain criteria, i.e. those which reflect the general Hacker News community.)

> but how can electromagnetic resonances not be considered in the discussion to warrant continual downvotes

Because everyone's been telling you the same thing in a multitude of different ways, but you refuse to get the point. Yes, the operation of the software we write can be affected by the physical realities of the hardware we use to execute them, but any time this causes the software to behave differently than it would in a purely hypothetical computer, this is considered an error and the results invalid. We even have hardware that automatically corrects for when rare events such as freakin' cosmic rays cause the value of a bit to flip (ECC memory).

The domain of software engineering is abstracted from physical reality. There's nothing useful to be gained from such discussions, because the whole job of hardware engineers is to enable us to operate at a higher level where we don't have to concern ourselves with the quantum electrodynamics necessary to make the transistors do their jobs.

Thank you for the response. What evidence do you use to support your judgment that the physical system is not affecting the mathematical transformations in the present example? Or is it just an assumption that you do not need to consider “quantum electrodynamic” parameters even though the motion involved is intrinsically natural, eg bird audio?

Can someone please explain to me why a hypothesis of magnetic frequencies contained in an electromagnetic cache (i.e. MP3) are not plausible to be transformable in real electrical output as opposed to downvoting me? It would be much more productive dialogue.

Your comment doesn't make sense, it sounds like new-age nonsense. I don't even understand what you're trying to say.

What is a magnetic frequency? What does it mean to be "activated in time"? How does any of that relate to the question?

There is no need for a "hypothesis" here, mp3 is not a mysterious physical phenomenon, it is a well-defined file format, backed by well-understood signal processing.

Sir, it relates to how does the information in the mp3 file maintain its fidelity whereas a spectrogram transformation of the original analog sampling becomes corrupted. Are you familiar with digital signal processing and sampling theory? Are you familiar with it to the most fundamentally electromagnetic level of understanding? Are you familiar with how computer memory is electromagnetically cached?

I’m quite alarmed that I am being virtually thrown under the bus by those who do not have the knowledge on how they program electromagnetic waveforms.

Sorry, but it's patently obvious that you are the one who isn't familiar with how any of this works. A spectrogram isn't corrupted, it's just a representation that chooses to not record some information - and thus it's not surprising that the information it excludes isn't recoverable. Whereas an MP3 is designed to include this information. That's all that is to it. Anything "electromagnetic" is entirely irrelevant at that level.

Well thank you for educating me on the spectrogram. But is not the spectrogram a digital system, ie software, output? Are you simply saying the program itself does not find it resourceful to reconstruct the analog time-series with complete fidelity? Why wouldn’t it? In theory it should be able to, know?

I had been merely supposing the spectrogram software programming causes such spurious frequencies rather than actual filtering of what is still fundamentally electromagnetic action.

This seems a little word-salad-y ?

MP3s are stored in bitstrings. It doesn't matter what the medium these bitstrings are stored in.

The question being asked is a question about information, not about physics. So, your response is inapplicable.

Yes, but what are bitstrings but elementally electrical action? What do you think information is fundamentally if it is not contained in a physical paradigm?

You could write the bitstring of MP3 on a piece of paper with a pen and it wouldn't change a thing.

Sir, surely you don’t imply that a computer can process that.

What he's implying, and what you're missing here, is that the discussion is about the FFT and MP3 algorithms, and how that signal processing affects the signal.

The physical substrate which executes the algorithm is irrelevant to the discussion. Indeed, even the fact that the signal represents sound waves is irrelevant. We could build a computer that executes the same instructions on the same inputs mechanically, using gears and valves and whatnot, or perform the instructions manually on paper, and the algorithm would result in exactly the same output.

Why don't you believe the principle of electromagnetic action effecting the algorithmic process vis-a-vis the mathematical transformation of the quantities of motion is irrelevant?

You’ve got the a double negation there. (I.e. you said “don’t” and “irrelevant”, when presumably what you meant to express would either have the “don’t” and “relevant” rather than “irrelevant”, or would lack the “don’t”)

You are wording things very strangely (E.g. “principle of electromagnetic action”, “quantities of motion”). This greatly hinders communication. Are you able to say things in a more normal way?

My guess is you are asking why they believe that the substrate the algorithm is implemented in is irrelevant. This is by nature of what it means to implement an algorithm. The same algorithm implemented faithfully (by which I mean, implemented such that it runs correctly, i.e. as specified) will behave the same regardless of the substrate, because what it means for the algorithm to run correctly is independent of the substrate. If it behaved differently in a different substrate, in that it gave a different output, then it would not be performing/implementing the same algorithm, by virtue of what it means to be implementing an algorithm.

Sorry for the double negation, thank you for interpreting it. I am striving to speak in a natural scientific language where the wording is very precise. The concepts you give as examples are commonly found in the minds of great thinkers such as Maxwell and de Broglie. So to speak more commonly is frankly less than ideal.

The algorithm being considered must be considered in an applied scientific paradigm. One is not simply examining a mathematical operation, but one which is being examined to cause natural movement - unless you believe an algorithmic process is occurring with nothing correspondent to nature? To consider it irrelevant to the question of the fidelity differences in a spectrogram analog conversion and an MP3 - which I must remind reader entails the electrical signal output to something humanely resourceful, e.g. listening in headphones - is lacking in critical insight. This is after all the original question, and not an issue of algorithmic differences, correct?

Sir, I will assume you downvoted me as opposed to answering my query. Philosophical dialogue may not be intended for you, but for those who aim for eternity, it is important to arrive at necessary truths before we proceed to explain what an algorithm is in practice.

"Elementally electrical action" ain't no well defined concept I've ever heard of.

Sorry, I meant “elementary”, as in “fundamentally” or “irreducibly”.

Hello HN! Author here. I was thinking to call the post "The underappreciated complexity of musical sounds" but decided to stick with the DFT one as it would probably get more attention. This is a small discovery I came across this weekend. FFT-based spectrograms of musical instruments isn't a novel thing do, but I thought what if I do a super highres spectrogram with a continuum of freqencies, instead of the N fixed ones FFT gives. Turns out, FFT "supports" such frequency shifting by multiplying the input by a specially constructed complex exponent. As a result, I've found out that musical instruments produce sophisticated ornaments in between the harmonic levels.

Did I understand this correctly, what you are doing is essentially:

X[n] = F[x[k]][n/2] if (n even) else F[x'[k]][(n+1)/2]

With F[x[k]] the DFT of the time-domain signal x[k], x'[k] = x[k]·exp(2·pi·i·k·alpha) and this alpha some constant which yields a frequency-domain shift by 25Hz.

If so: How does this method compare to zero-padding the time-domain signal (i.e. sinc-interpolating the frequency domain)? It is an interesting concept, but alas it's not immediately clear to me how to analyze this...

This sounds about right. I assume your (n+1)/2 is really n+1/2. The idea, like you've said, is to get Y[k+1/2] values where Y = FFT[X].

Whether this is mathematically sound is another question. I presume that it is, for two reasons. First, FFT essentially convolves X with a bunch of sinusoids with frequencies from a fixed set: 0 Hz, 50 Hz, 100 Hz and so on. There's nothing wrong with manually convolving X with a 57.3 Hz sinusoid, it's just FFT isn't designed for this (it's designed for rapid computation). The other reason is that combining such shifted FFTs we get what looks almost exactly like a CWT (i.e. wavelet transform).

As for sinc-interpolation, I think it's mathematically equivalent. Say we shift the input X with Z[k] = exp(ik/N...) and get XZ. Then we transform it to FFT[XZ] = FFT[X] conv FFT[Z], so it's convolving FFT[X] with FFT[Z] where FFT[Z] is probably that sinc kernel. I certainly know from experiments that FFT of exp(2·pi·i·k·alpha) where alpha doesn't precisely align with the 1024 grid produces a fuzzy function with a max around alpha and a bell-shaped curved around it, the width of the curve depends on how precisely alpha fits into one of the 1024 grid points.

Instead of combining 2 FFTs of 1024 bins (one shifted + one non-shifted), could you not just calculate 1 FFT of 2048 bins? Isn't it the same result?

Larger FFT window has undesired side effects because the estimated frequencies are averaged over the entire window. Moreover, the FFT output always spans from 0 Hz to 24 kHz (with 48 kHz sample rate), so to zoom into the 0..6 kHz region we'll need a window with 8192 bins or about 150 ms. With such window it would be impossible to capture rapid volume oscillations.

My mind is kind of blown that birdsong virtually does not include higher harmonics. I didn’t even think that was possible for a physical resonator. Great post

I think the mystery has a simple explanation: when a bird sings at 7 kHz and the mp3 file captures only first 20 kHz, there isn't much room for harmonics. Maybe birds do have interesting harmonics at 56 kHz, we just don't know.

maybe they were not captured by the bandiwth-limited microphone ?

Thanks for writing this up, I'm always on the look out for alternative methods for DFTs and the like, currently concentrating on interpolation of low frequencies (after DC, but still within the first 5% of wave numbers) . I'll see if this fits my use case soon, hopefully today.

What is an ornament?

> A typical FFT-based spectrogram uses 1024 bins on a 48 kHz audio, with about 50 Hz step per pixel. Most of the interesting audio activity happens below 3 kHz, so 50 Hz per pixel gives only 60 pixels for that area.

That seems misleading. First of all, how often do you take a 1024 sample FFT? In theory, you could calculate it every sample, in which case you have 60 pixels, but 48,000 times per second.

Secondly, you can make use of frame-over-frame phase information. If you are looking at signals with mostly periodic content in that 3 kHz band, the phase information can indicate how much the signal in a given band deviates from that band's center frequency.

If the signal is dead on the frequency, then the phase component is stable frame-over-frame; the value does not move. If the signal is off, the phase angle shifts, kind of like a CRT television that is out of vertical sync. Each frame finds catches the signal in a different phase compared to the previus frame due to the frequency drift. The farther the signal is from the FFT band's frequency, the faster the phase angle rotates.

If you analyze the movement of phase of the same bin between successive frames, you can get a higher resolution estimate of the frequency than what you might think is possible from the 50 Hz resolution of that bin.

What you can't resolve is the situation when multiple independent signals clash into that same frequency bin. The assumption has to holds that the the bin has caught one periodic signal.

This looks cool! But really needs "before" and "after" comparison images -- lo-res vs hi-res.

Seeing the hi-res images only gives me no idea what kind of improvement this is showing...

@gbh444g Hope you could maybe add some lo-res versions :)

(Would also be cool to have audio clips next to each image as well, but that's less important.)

It had a bit of that for the bird songs.

I wonder how can we make assumptions about the bird songs while not taking into account how birds perceive the sound.

For humans it's easier, there were a plenty of studies done in that regards and there is even a separate science field for studying the human sound perception - Psychoacoustics. Humans perceive sound in bands (a band is a range of frequencies), not separate frequencies. And the size of bands vary per frequency so that in the voice range it's more narrow than, for example, high frequencies. The FFT fits very nicely into that picture and codecs were designed considering the human perception.

As for animals, I don't know any studies in that regards. I would assume that the way of perception should be very similar to the one human has, at least on the mechanics level. As for the sensitivity and the size of bands as well as dynamic range - it's hard to say. I'd love to see some studies that dig into details there but it seems that it's very hard to do them. Animals don't give you a direct feedback.

Some of my family and I have been enjoying playing with the BirdNET[0] app which seems to use the ideas presented here to identify birds from recordings, utilizing machine learning.

[0] https://play.google.com/store/apps/details?id=de.tu_chemnitz..., https://apps.apple.com/us/app/birdnet/id1541842885

The spectrograms on this site have a lot of spectral leakage. This can be improved a lot by applying a window function (blackman, hanning etc). It doesn't seem like the author does this.

Applying the Hann window function eliminates all the spectral leakage, but it also makes the image rather dull and precise, very similar to CWT. You've made me realise that the intricate patterns seen on violin spectragrams are the result of interference of the spectral leakage from main harmonics. It doesn't mean the patterns are fake. It means the patterns emerge only when the input sound is transformed a certain way (FFT with the rectangular window).

> as if birds “draw” with sound something that’s flying backwards in time


Just a heads up, you have to click the images to see the full resolution version! I spent a good while confused about not being able to see the details mentioned in the images.

> Smoothness in the time direction is easier to achieve: the 1024 bins window can be advanced by arbitrarily small time steps.

It appears you're doing just that, but the time "width" is still readily apparent in many of the spectrograms, most obviously on the birdsong ones -- almost like a horizontal motion blur.

Would a deconvolution filter be able to meaningfully horizontally "deblur" the spectrograms? So the birdsongs didn't appear to be drawn with a wide-tip marker, but rather a ballpoint pen? So not just hi-res, but hi-focus.

Thank you for that, that is fascinating!

On that note, also checkout wavelets to generate spectrograms: https://en.wikipedia.org/wiki/Wavelet

I have some implementations here: https://github.com/Lichtso/CCWT https://github.com/Lichtso/WebSpectrogram

This is fantastic! About 5 years ago (just before this repo was made it seems) I was doing a ton of stuff with EEG analysis with python. Used CWTs a ton but it was slooow, even with lots of numpy tricks. This would have been super handy.

I also learned in that time that while you can extract a complex signal from a real one using the Hilbert transform, it's not quite the same, and I've always wondered if we could achieve better fidelity/encoding/compression by starting with quadrature signals. Never figured out quite why, since Shannon-Nyquist says you should be able to encode all information of a bandwidth f signal with 2f sample rate, but I suspect it has to do with the difference between ideal real number math and nonlinear, quantizing, imperfect ADCs.

Not sure how you'd actually get quadrature signals from sound waves or any wideband scalar signal (maybe record at far higher sampling rate to get more phase information, then downsample), but it's a fun thought experiment.

You can convert to quadrature by either sampling a signal at >= 2 * Nyquist and using the Hilbert transform, or using two ADCs 90 degrees out of phase.

When you run two ADCs 90 degrees out of phase that introduces another source of error, due to timing jitter. There's no reason to bother doing this for audio signals because modern ADCs are more than capable of accurately sampling at audio rates.

> Despite this CWT implementation runs on GPU and this “advanced” FFT runs on JS, CWT is about 50-100x slower.

Sounds like a really crappy implementation of CWT. Besides this, the mother wavelet used was not specified, so maybe the author doesn't really know much about CWT.

I love this and have been looking for a program that's like Photoshop for sound.

>looking for a program that's like Photoshop for sound.

Look at Izotope RX.

Especially the Spectral Repair module might be what you are imagining, but it has a lot of interesting tools. This is from an older version: https://www.youtube.com/watch?v=vNtxg28wx_M

There is Photosounder which have excellent sound quality (all edit is in frequency domain and it then convert back)

For free there is also Virtual ANS and https://www.fsynth.com on the more experimental side (conversion is done raw using additive synthesis, phase information is lost so sound quality is affected)

You can try interpreting images as spectrograms, but the result will be a cacophonic mess.

There’s a reason why nobody does this. (Other than avantgarde experimental composers maybe, but they are looking for cacophony)

Most often yields a cacophonous mess, it's true - but it really depends how carefully the image is made. If the image was made using a fullly-detail fft of a sound then you could in theory get the exact same sound back out that you put in I reckon! :) (admittedly negative color values will be required (or clever mapping), and spreading time-in-the-sound over space-in-the-image).

Words escaped me before by "fully detail", as explained on another threads up there :), I meant that the input image must contain (or have extracted from it) phase information in addition to the amplitudes that a spectrograph-like extraction would provide to have full control over the result (alternatively, a complex valued image (usually positive and negative) would do it). It's hard to make phase-info up from nothing I reckon, maybe that's why many image-to-sound things sound harsh/strange?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact