
Using Deep Learning to Reconstruct High-Resolution Audio - yurisagalov
https://blog.insightdatascience.com/using-deep-learning-to-reconstruct-high-resolution-audio-29deee8b7ccd
======
eggoa
I hesitate to even post this, but I listened to the audio examples and it
seems like this project was not yet a success. I'm not trying to be a jerk or
snarky, but the reconstructed audio sounded terrible.

~~~
seandougall
I have to agree. There's certainly more high-frequency content, but it seems
mostly like noise, with only a vague amplitude correlation to the existing
audio.

I'd be curious to see if any better results could be obtained by applying a
similar technique in the frequency domain.

~~~
jhetherly
hey, author here

Thanks for the feedback.

"applying a similar technique in the frequency domain", "Maybe training an
image reconstructor on the short term spectrogram" \- This is what I
originally thought to do. However, this approach suffers from information loss
whenever you transform from the frequency domain back to the time domain.
Since the goal was super-resolution in the time domain, working in the time
domain is more sensible.

~~~
tasty_freeze
Mathematically, the DFT is invertable, ie lossless, but practically there will
be a bit of loss due to the finite precision of float point numbers. Even
though it isn't lossless, the amount of loss should be miniscule as compared
to the 16KHz->2KHz loss you are trying to overcome.

~~~
volkuleshov
The problem with the DFT is not whether it's lossless or not, it's that it may
not be the best feature representation for a given task.

Both the DFT and the proposed model apply convolutions to the input, but in
the former case, these are fixed, while in the latter, they are learned.

This is similar to how we don't use hard-coded features like SIFT or wavelets,
or Gabor filters when we do image classification with a CNN.

------
volkuleshov
I'm one of the authors of the paper that proposes the deep learning model
implemented in the blog post, and I would recommend training on a different
dataset, such as VCTK (freely available, and what we used in our paper).

Super-resolution methods are very sensitive to the choice of training data.
They will overfit seemingly insignificant properties of the training set, such
as the type of low-pass filter you are using, or the acoustic conditions under
which the recordings were made (e.g. distance to the microphone when recording
a speaker).

To capture all the variations present in the TED talks dataset, you would need
a very large model and probably train it for >10 epochs. The VCTK dataset is
better in this regard.

For comparison, here are our samples: kuleshov.github.io/audio-super-res/

I'm going to try to release the code over the weekend.

~~~
jhetherly
Thanks for commenting and the suggestion!

Indeed, the TED dataset has a lot of variability in terms of audio quality,
etc. which, as you mentioned, with just 10 epochs of training is difficult to
capture. I did try a larger network (up to 11 downsampling layers), but this
proved even more time consuming to train (as expected). Thus, I split the
difference and went with a network similar to yours but was trainable over a
four-day period (10 epochs).

------
hackpert
I'm interested in seeing how computationally efficient this method turns out
to be and how well it generalizes to other audio data and perhaps other
signals as well. Going on a hunch by the model, I think there are some more
efficient methods to do bandwidth extension on audio samples with better
quality results, but it is great to see more deep learning people take an
interest in this domain. I do believe that deep learning can have tremendous
impact in DSP and compression.

(Disclaimer: I developed a somewhat similar method earlier this year applied
in audio compression, yet to be published)

~~~
starchild3001
Thanks for mentioning. Any links to your (or relevant) work?

------
crazygringo
While something like this is bound to fail for most music of any complexity
(e.g. a singing voice), I've often wondered if this would be highly successful
on, say, old solo piano recordings, where the possibilities of the instrument
are extremely well-defined and limited.

------
starchild3001
Thanks for sharing. The possibilities for this kind of technology are endless.
Maybe one day we'll start having crystal clear conversations over telephone :)

------
bob1029
I am a little curious as to how this factors into fundamental information
theory.

In my mind, you are simply taking a 0-2khz signal and combining it with an
entirely different 0-8khz signal that is generated (arbitrarily IMO) based on
the band-limited original data. I can see the argument for having a library of
samples as additional, common information (think many compressor algorithms),
but it is still going to be an approximation (lossy).

"The loss function used was the mean-squared error between the output waveform
and the original, high-resolution waveform." \- This confuses me as a
performance metric when dealing with audio waveforms.

I think a good question might be - "What would be better criteria for
evaluating the Q (quality) of this system?"

THD between original and output averaged over the duration of the waveforms?
Subjective evaluations (w/ man in the middle training)? etc...

------
cnxhk
The title is good. The performance is limited and number of examples is not
enough to make any useful conclusion.

