I don't do neural nets, but if I had to crudely estimate...
10,000 songs (the outputs) * 16 layers * 16 parameters per node * 4 for the bytes per float = 10MB + a DB of song/artist names.
I'm probably underestimating the parameters per node, and overestimating the size of the layers closer to the input. Further, it's more likely structured as an LSTM than a convolutional network, since sound is a streaming source.