Working with a spectrogram is definitely similar to working with an image, and it's interesting to think why that's the case.
Take convolutional models, for example. Very effective for working with images because they're (a) parameter efficient, (b) learn local/spatial correlations in input features, and (c) exploit translational invariance. As an oversimplification, we can train models to visually identify "things" in images by their edges.
If you think about what's going on with an audio spectrogram, you can see the same concepts at work. There's local/spatial correlation - certain sounds tend to have similar power coefficients in similar frequency buckets. These are also correlated in time (because the pitch envelope of the word "yes" tends to have the same shape), and convolutional models can also exploit time-invariance (in the sense that convolutional models can learn the word "yes" from samples where the word appears with varying amounts of silence to the left and right).
That being said, the addition of the time domain makes audio quite hard to work with, and (usually) not as simple as just running a spectrogram through a vanilla image classification model. But it's definitely enlightening to think about how these models are "learning".
Thanks for that note. I have an audio classification hobby project (for now). Could you point me to things I should learn to get better at audio classification and generation?
Your comment about time domain making audio difficult - before doing some research I thought it would make it impossible. But looks like people have had some success with using spectrograms of short audio samples. What techniques should I try to learn to deal with the time component of audio?
One idea is to chop up the audio into short samples and treat the resulting images as a video. Then look at DL algorithms that deal with video. Am I on the right track?
Take convolutional models, for example. Very effective for working with images because they're (a) parameter efficient, (b) learn local/spatial correlations in input features, and (c) exploit translational invariance. As an oversimplification, we can train models to visually identify "things" in images by their edges.
If you think about what's going on with an audio spectrogram, you can see the same concepts at work. There's local/spatial correlation - certain sounds tend to have similar power coefficients in similar frequency buckets. These are also correlated in time (because the pitch envelope of the word "yes" tends to have the same shape), and convolutional models can also exploit time-invariance (in the sense that convolutional models can learn the word "yes" from samples where the word appears with varying amounts of silence to the left and right).
That being said, the addition of the time domain makes audio quite hard to work with, and (usually) not as simple as just running a spectrogram through a vanilla image classification model. But it's definitely enlightening to think about how these models are "learning".