Working with a spectrogram is definitely similar to working with an image, and it's interesting to think why that's the case.
Take convolutional models, for example. Very effective for working with images because they're (a) parameter efficient, (b) learn local/spatial correlations in input features, and (c) exploit translational invariance. As an oversimplification, we can train models to visually identify "things" in images by their edges.
If you think about what's going on with an audio spectrogram, you can see the same concepts at work. There's local/spatial correlation - certain sounds tend to have similar power coefficients in similar frequency buckets. These are also correlated in time (because the pitch envelope of the word "yes" tends to have the same shape), and convolutional models can also exploit time-invariance (in the sense that convolutional models can learn the word "yes" from samples where the word appears with varying amounts of silence to the left and right).
That being said, the addition of the time domain makes audio quite hard to work with, and (usually) not as simple as just running a spectrogram through a vanilla image classification model. But it's definitely enlightening to think about how these models are "learning".
Thanks for that note. I have an audio classification hobby project (for now). Could you point me to things I should learn to get better at audio classification and generation?
Your comment about time domain making audio difficult - before doing some research I thought it would make it impossible. But looks like people have had some success with using spectrograms of short audio samples. What techniques should I try to learn to deal with the time component of audio?
One idea is to chop up the audio into short samples and treat the resulting images as a video. Then look at DL algorithms that deal with video. Am I on the right track?
I think a major reason for this is because of transfer learning. For computer vision, there are many good pretrained models that were trained on huge datasets (like ImageNet) that can be fine-tuned for custom tasks. Other fields often do not have such pretrained models and huge datasets to work on, so it turns out transforming a dataset into an image dataset and fine-tuning a pretrained model works better than training from scratch.
Even DL used for audio processing (classification, separation etc) seems to convert audio to spectral graphs and apply DL to that.
Changing a problem to be expressed as image inputs will be an advantage when using DL as a solution. Would you agree ?