Converting it to a spectrogram was a nice step.

From the perspective of other source data, I wonder if that limits you to five features (X,Y and RGB) or whether you could extend to fictional/non-human-visible colours as extra features and just be unable to view them in the weight maps.

It mentions in the article that the spectrogram is really grayscale instead of having RGB channels.

An interesting idea. Yes, you can usually use a convnet with any number of channels per pixel.

