I've always found it interesting that while that's fundamentally true in terms of information, my understanding is that we perceive things with far more resolution than the uncertainty principle would allow. Specifically, we're able to judge frequencies with far more accuracy than a fuzzy spectrogram would suggest.
From what I understand, our brain essentially performs a kind of "deconvolution" on the fuzzy frequency data to identify a far "sharper" and defined frequency, which is relatively straightforward since the frequency "spread" is a known quantity.
This works well most of the time because we correctly assume we're dealing with relatively isolated sound sources emanating a distinct fundamental with a distinct series of overtones.
Our perception can become innacurate when that assumption fails to hold, and so sounds merge or become indistinguishable, we hear beat tones that don't technically exist, our brain gives up trying to hear frequencies and classifies it all as noise, etc.
I've never come across audio spectrogram software that attempted to perform a frequency deconvolution in a way that roughly simulates what our own ears do, but I'd love to know if anyone else has and could point me to it.
Amir from AudioScienceReview did a good introductory video about the psychoacoustica as well as frequency response in general https://www.youtube.com/watch?v=TwGd0aMn1wE
With EMD, a phantom "beat frequency" would actually show up in the transform space.
I think the software you are looking for would have to be based on a machine learning rather than purely theory-based approach if its intended for use with natural sound signals.
EDIT: Scene_Cast2’s comment below says it’s Empirical Mode Decomposition, not autocorrelation.
Anyway, I see no reason why spectrograms have to be fuzzy… a wide window size can locate frequencies very precisely while smoothing out fast variations in amplitude, which sounds pretty similar to how we hear things.
(Interestingly, when analysing the voice, linguists tend to use the opposite: a narrow window size, which smears out frequencies making the resonance bands more obvious, while allowing visualisation of fast glottal vibrations.)
That's how many voice coding algorithms work, you try to find a digital filter that generates a sound that is as close as the original according to a perception based metric, then transmit the filter coefficients.
I don't remember the exact details, but if I'm not mistaken generating this sort of metric is really time consuming.
Using deconvolution would be more for the purposes of cleaning up a general-purpose spectrogram for human eyes -- for analysis and for sound editing, whether a single voice or a band of several instruments.
Below is my GPU-based CWT that's 50x slower than the JS-only version in the post above.
When you convert a spectrogram back into sound it sounds like crap, but then how does MP3 store the frequency information (and why can't we use that for visualizations)?
The math is beyond my understanding, can anyone give some kind of analogy maybe?
fft gives you the spectrum + the phase. if you only use the spectrum to resynthesise you're missing half the information. temporal domain <-> spectral domain is a 99.9999999% lossless (not 100% I believe because of floating-point shenanigans, but enough to not matter at all) transform in both directions.
MP3 does not have remarkable fidelity though. MP3, and my clone of it, suffers from time domain artifacts. Quantization in the frequency domain causes distortion in the time domain as well, negatively affecting high frequency transient sounds like cymbals. That is more noticeable. Newer generation codecs like AAC handle transients much better, but they are considerably more advanced, and often use different transforms like wavelet transform.
I'm not sure what you mean by converting the spectrogram to sound, but my guess is that the windowing done on the short-time Fourier transform is causing artifacts.
I keep reading by commenters that a purely mathematical space which operates on physical movement has no relevance to physical movement vis-a-vis spectral analysis and particularly in how this analyzer - again embedded with recursive electromagnetic charges (software code) - does not affect the original time-sample. I am aware now that MP3 encoding is capturing more information than the analyzer - by design - but how can electromagnetic resonances not be considered in the discussion to warrant continual downvotes?
(I'd like to remind readers that downvotes are not made available for all users, but only those who meet a certain criteria, i.e. those which reflect the general Hacker News community.)
Because everyone's been telling you the same thing in a multitude of different ways, but you refuse to get the point. Yes, the operation of the software we write can be affected by the physical realities of the hardware we use to execute them, but any time this causes the software to behave differently than it would in a purely hypothetical computer, this is considered an error and the results invalid. We even have hardware that automatically corrects for when rare events such as freakin' cosmic rays cause the value of a bit to flip (ECC memory).
The domain of software engineering is abstracted from physical reality. There's nothing useful to be gained from such discussions, because the whole job of hardware engineers is to enable us to operate at a higher level where we don't have to concern ourselves with the quantum electrodynamics necessary to make the transistors do their jobs.
What is a magnetic frequency? What does it mean to be "activated in time"? How does any of that relate to the question?
There is no need for a "hypothesis" here, mp3 is not a mysterious physical phenomenon, it is a well-defined file format, backed by well-understood signal processing.
I’m quite alarmed that I am being virtually thrown under the bus by those who do not have the knowledge on how they program electromagnetic waveforms.
I had been merely supposing the spectrogram software programming causes such spurious frequencies rather than actual filtering of what is still fundamentally electromagnetic action.
MP3s are stored in bitstrings. It doesn't matter what the medium these bitstrings are stored in.
The question being asked is a question about information, not about physics. So, your response is inapplicable.
The physical substrate which executes the algorithm is irrelevant to the discussion. Indeed, even the fact that the signal represents sound waves is irrelevant. We could build a computer that executes the same instructions on the same inputs mechanically, using gears and valves and whatnot, or perform the instructions manually on paper, and the algorithm would result in exactly the same output.
You are wording things very strangely (E.g. “principle of electromagnetic action”, “quantities of motion”). This greatly hinders communication. Are you able to say things in a more normal way?
My guess is you are asking why they believe that the substrate the algorithm is implemented in is irrelevant. This is by nature of what it means to implement an algorithm. The same algorithm implemented faithfully (by which I mean, implemented such that it runs correctly, i.e. as specified) will behave the same regardless of the substrate, because what it means for the algorithm to run correctly is independent of the substrate. If it behaved differently in a different substrate, in that it gave a different output, then it would not be performing/implementing the same algorithm, by virtue of what it means to be implementing an algorithm.
The algorithm being considered must be considered in an applied scientific paradigm. One is not simply examining a mathematical operation, but one which is being examined to cause natural movement - unless you believe an algorithmic process is occurring with nothing correspondent to nature? To consider it irrelevant to the question of the fidelity differences in a spectrogram analog conversion and an MP3 - which I must remind reader entails the electrical signal output to something humanely resourceful, e.g. listening in headphones - is lacking in critical insight. This is after all the original question, and not an issue of algorithmic differences, correct?
X[n] = F[x[k]][n/2] if (n even) else F[x'[k]][(n+1)/2]
With F[x[k]] the DFT of the time-domain signal x[k], x'[k] = x[k]·exp(2·pi·i·k·alpha) and this alpha some constant which yields a frequency-domain shift by 25Hz.
If so: How does this method compare to zero-padding the time-domain signal (i.e. sinc-interpolating the frequency domain)?
It is an interesting concept, but alas it's not immediately clear to me how to analyze this...
Whether this is mathematically sound is another question. I presume that it is, for two reasons. First, FFT essentially convolves X with a bunch of sinusoids with frequencies from a fixed set: 0 Hz, 50 Hz, 100 Hz and so on. There's nothing wrong with manually convolving X with a 57.3 Hz sinusoid, it's just FFT isn't designed for this (it's designed for rapid computation). The other reason is that combining such shifted FFTs we get what looks almost exactly like a CWT (i.e. wavelet transform).
As for sinc-interpolation, I think it's mathematically equivalent. Say we shift the input X with Z[k] = exp(ik/N...) and get XZ. Then we transform it to FFT[XZ] = FFT[X] conv FFT[Z], so it's convolving FFT[X] with FFT[Z] where FFT[Z] is probably that sinc kernel. I certainly know from experiments that FFT of exp(2·pi·i·k·alpha) where alpha doesn't precisely align with the 1024 grid produces a fuzzy function with a max around alpha and a bell-shaped curved around it, the width of the curve depends on how precisely alpha fits into one of the 1024 grid points.
That seems misleading. First of all, how often do you take a 1024 sample FFT? In theory, you could calculate it every sample, in which case you have 60 pixels, but 48,000 times per second.
Secondly, you can make use of frame-over-frame phase information. If you are looking at signals with mostly periodic content in that 3 kHz band, the phase information can indicate how much the signal in a given band deviates from that band's center frequency.
If the signal is dead on the frequency, then the phase component is stable frame-over-frame; the value does not move. If the signal is off, the phase angle shifts, kind of like a CRT television that is out of vertical sync. Each frame finds catches the signal in a different phase compared to the previus frame due to the frequency drift. The farther the signal is from the FFT band's frequency, the faster the phase angle rotates.
If you analyze the movement of phase of the same bin between successive frames, you can get a higher resolution estimate of the frequency than what you might think is possible from the 50 Hz resolution of that bin.
What you can't resolve is the situation when multiple independent signals clash into that same frequency bin. The assumption has to holds that the the bin has caught one periodic signal.
Seeing the hi-res images only gives me no idea what kind of improvement this is showing...
@gbh444g Hope you could maybe add some lo-res versions :)
(Would also be cool to have audio clips next to each image as well, but that's less important.)
For humans it's easier, there were a plenty of studies done in that regards and there is even a separate science field for studying the human sound perception - Psychoacoustics. Humans perceive sound in bands (a band is a range of frequencies), not separate frequencies. And the size of bands vary per frequency so that in the voice range it's more narrow than, for example, high frequencies. The FFT fits very nicely into that picture and codecs were designed considering the human perception.
As for animals, I don't know any studies in that regards. I would assume that the way of perception should be very similar to the one human has, at least on the mechanics level. As for the sensitivity and the size of bands as well as dynamic range - it's hard to say. I'd love to see some studies that dig into details there but it seems that it's very hard to do them. Animals don't give you a direct feedback.
 https://play.google.com/store/apps/details?id=de.tu_chemnitz..., https://apps.apple.com/us/app/birdnet/id1541842885
It appears you're doing just that, but the time "width" is still readily apparent in many of the spectrograms, most obviously on the birdsong ones -- almost like a horizontal motion blur.
Would a deconvolution filter be able to meaningfully horizontally "deblur" the spectrograms? So the birdsongs didn't appear to be drawn with a wide-tip marker, but rather a ballpoint pen? So not just hi-res, but hi-focus.
I have some implementations here:
I also learned in that time that while you can extract a complex signal from a real one using the Hilbert transform, it's not quite the same, and I've always wondered if we could achieve better fidelity/encoding/compression by starting with quadrature signals. Never figured out quite why, since Shannon-Nyquist says you should be able to encode all information of a bandwidth f signal with 2f sample rate, but I suspect it has to do with the difference between ideal real number math and nonlinear, quantizing, imperfect ADCs.
Not sure how you'd actually get quadrature signals from sound waves or any wideband scalar signal (maybe record at far higher sampling rate to get more phase information, then downsample), but it's a fun thought experiment.
When you run two ADCs 90 degrees out of phase that introduces another source of error, due to timing jitter. There's no reason to bother doing this for audio signals because modern ADCs are more than capable of accurately sampling at audio rates.
Sounds like a really crappy implementation of CWT. Besides this, the mother wavelet used was not specified, so maybe the author doesn't really know much about CWT.
Look at Izotope RX.
Especially the Spectral Repair module might be what you are imagining, but it has a lot of interesting tools. This is from an older version: https://www.youtube.com/watch?v=vNtxg28wx_M
For free there is also Virtual ANS and https://www.fsynth.com on the more experimental side (conversion is done raw using additive synthesis, phase information is lost so sound quality is affected)
There’s a reason why nobody does this. (Other than avantgarde experimental composers maybe, but they are looking for cacophony)