
Audio pre-processing for Machine Learning: Getting things right - scarecrow112
https://github.com/scarecrow1123/blog/issues/9
======
jacquesm
This is a good starting point but it ends just when things get interesting. If
you are going to process audio for ML make sure you experiment with
normalizing the input volume, this can make a huge difference and try if your
inputs are in stereo to process both mono, single channel and stereo inputs to
see which one performs better.

Finally, if you pre-process the audio using an FFT try different FFT sizes.

~~~
qmmmur
It might be good to understand how changing the hop and window size affect the
analysis so you're not blindly changing settings.

The trade off for window size is frequency resolution and time resolution. A
bigger window gives you narrower bands, so more frequency resolution while
giving you less temporal resolution where an onset of transient is significant
in the analysis. Similarly, hop size will determine how 'leaky' the process is
and how fine grained the windows will be. This can effect detecting quick
peaks or changes while possibly smearing them across a few windows.

------
Hydraulix989
Careful! FFMPEG has an infectious license and the authors WILL publicly
humiliate you on their Hall of Shame if you get caught misusing it by not open
sourcing your whole application:

[https://ffmpeg.org/shame.html](https://ffmpeg.org/shame.html)

------
jkadlec
Some good basic info, but at the same time there are some inaccuracies. WAV is
not a lossless format, it's a container, it can contain any compressed audio
format, even mp3. You can have PCM inside WAV, which is indeed lossless, but
you're not going to see that in the wild too often. Going with 16k is also
questionable, since most readily available pre-existing datasets, were
recorded in 8k (which is what telephony codecs mostly use).

~~~
qmmmur
WAV is almost always lossless with PCM data. I'm not sure where you got the
impression that "you don't see that in the wild too often". Depending on what
kind of analysis you need to having your audio at 8k is going to deem any
results useless. I would have it minimum 16k and aim for 44.1k in order to
preserve the top end which is where a large quantity of useful information is.
The reason most sets are recorded in 8khz is that they are running MFCC's
which are quite stubborn and insensitive to the high end anyway with most
enough information for machine learning existing in the bottom end. If you're
doing music, or environmental sounds you really need to preserve the other
frequency bands.

------
tsomctl
I spent some time using the synchrosqueeze transform to preproccess audio
files before feeding it into the network. Basically, you do a cwt, and then
feed the output of that into the synchrosqueeze. It basically sharpens up the
cwt, so that a signal isn't spread out into so many bins. The output of the
cwt is complex, and normally you through the imaginary data away. The
synchrosqueeze transform uses the imaginary data to work it's magic. Figures 5
and 6 of the below pdf are good examples.

I believe I based my code of this matlab code:
[https://github.com/ebrevdo/synchrosqueezing/tree/master/sync...](https://github.com/ebrevdo/synchrosqueezing/tree/master/synchrosqueezing)

The above matlab code is ridiculously slow, I rewrote it using sse intrinsics,
and got it several orders of magnitude faster.

I hope this helps out someone. I never really produced anything with it, but I
still feel it is promising.

[https://services.math.duke.edu/~jianfeng/paper/synsquez.pdf](https://services.math.duke.edu/~jianfeng/paper/synsquez.pdf)

------
a-dub
Use this:
[https://github.com/librosa/librosa](https://github.com/librosa/librosa)

