For a perfect solution, it would simply be a matter of subtracting one waveform from the other. I suspect both isolated voice and isolated music would have significant "noise" leftover. This would likely be more noticable in the voice due to our increased perception of odd vocal sounds.

