
Audio AI: isolating vocals from stereo music using Convolutional Neural Networks - turbohz
https://towardsdatascience.com/audio-ai-isolating-vocals-from-stereo-music-using-convolutional-neural-networks-210532383785
======
btown
This is an awesome project, but it seems it was done without reference to
academic literature on source separation. In fact, people have been doing
audio source separation for years with neural networks.

For instance, Eric Humphrey at Spotify Music Understanding Group describes
using a U-Net architecture here: [https://medium.com/this-week-in-machine-
learning-ai/separati...](https://medium.com/this-week-in-machine-learning-
ai/separating-vocals-in-recorded-music-at-spotify-with-eric-
humphrey-51c2f85d1451) \- paper at
[http://openaccess.city.ac.uk/19289/1/7bb8d1600fba70dd7940877...](http://openaccess.city.ac.uk/19289/1/7bb8d1600fba70dd79408775cd0c37a4ff62.pdf)

They compare their performance to the widely-cited state of the art Chimera
model (Luo 2017):
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5791533/#R24](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5791533/#R24)
with examples at
[http://naplab.ee.columbia.edu/ivs.html](http://naplab.ee.columbia.edu/ivs.html)
\- from the examples, there's significantly less distortion than OP.

Not to discourage OP from doing first-principles research at all! But it's
often useful to engage with the larger community and know what's succeeded and
failed in the past. This is a problem domain where progress could change the
entire creative landscape around derivative works ("mashups" and the like),
and interested researchers could do well to look towards collaboration rather
than reinventing each others' wheels.

EDIT: The SANE conference has talks by Humphrey and many others available
online:
[https://www.youtube.com/channel/UCsdxfneC1EdPorDUq9_XUJA/vid...](https://www.youtube.com/channel/UCsdxfneC1EdPorDUq9_XUJA/videos)

~~~
musicale
People have also been doing audio source separation effectively for years
without neural networks.

~~~
calf
It's interesting cause I have a recording of human voices plus a background TV
show that was too loud; I've looked around for something that would be able to
separate the two but I haven't found a straightforward solution.

For example if you Google then FASST is one of the ones that come up, but it's
a whole framework and in order to use it you'd have to learn the research
yourself; much of these software is not geared for end users.

~~~
lightedman
Learn how to do waveform inversions - if you have a stereo signal, anything
not fully-centered will come through better while the rest is cut out. You can
then take that, invert it, and play it back with the original, cutting out
that noise and keeping the fully-centered things like vocals present.

This is how I play guitar to my favorite songs on my computer.

~~~
proctor
this sounds interesting. could you elaborate a bit? it is unclear if you are
inverting once, or twice. "if you have a stereo signal, anything not fully-
centered will come through better while the rest is cut out" \-- is this
before or after an inversion?

~~~
thibauts
Of stereo tracks L and R, you invert R and add it to L, effectively canceling
anything centered. This usually removes voices. If you subtract (invert then
add) the result from the original L and R tracks you get centered sounds only.
Results range from perfect to not effective at all depending on the songs you
apply it to.

~~~
lightedman
This is correct. It is a shame there are no music players that can do this
natively. I have to do every track I wish to play along with by hand in
Audacity (or if I'm using my Win2K laptop, Cool Edit Pro 2000.)

Alternatively (in theory, anyways,) you can do a spectral analysis, create a
selection/range of frequencies to invert, and that can usually handle
isolating individual instruments regardless of stereo isolation as long as the
other instrument octave ranges do not overlap with the one you're isolating.
I'd love a music player that could do that.

------
emcq
What motivates people to invent phrases like "perceptual binarization" when
googling "audio binary mask" literally gives you citations in the field that
have been doing this for years?

For example, 2009 musical sound separation based on binary time frequency
masking.

Or more recent stuff using deep learning. Also the field generally prefers
ratio masks because they lead to better sounding output.

~~~
avian
I know from my own experience that it's possible to dig yourself quite deep
into some niche research field without realizing that there's an existing body
of knowledge about it. If you or nobody else in your research circle knows the
right keywords to enter into search fields it's really easy to overlook piles
of published papers.

I want to say things were different back when we relied more on human
librarians in searching for literature, but unfortunately history is full of
cases where people independently discovered the same things as well.

~~~
ska
My approach to avoid this is always to try and find a recent and well written
Masters or Ph.D thesis in the area. You can't always find them of course, but
if you do they tend to have pretty good context and a more detailed
bibliography than you'll find elsewhere.

That said, if you are still at the point of inventing new terms for things
people have been doing for decades, you are probably being fairly superficial
in the area as well.

Research areas like CNN are especially prone to this because it is so much
easier to apply the techniques than to understand the problem domain, and it
generates a lot of low quality research papers. See also "when all you have is
a hammer".

~~~
ericb
> My approach to avoid this is always to try and find a recent and well
> written Masters or Ph.D thesis in the area.

How do you find this?

~~~
ska
Typically start by looking at lab/group home pages who are active and strong
in the area . Some universities have good seach capabilities and digital
copies, others don’t. Lots of people will have draft versions on their pages
though.

If you find someone very active in the area and you like the look of their
papers, you can always try asking them...

------
GistNoesis
Hello, a little self promotion, you can see it our experiment with some deep
neural networks doing real-time audio processing in the browser, using
tensorflow.js

[http://gistnoesis.github.io/](http://gistnoesis.github.io/)

If you want to see how it's done it's shared source :
[https://github.com/GistNoesis/Wisteria/](https://github.com/GistNoesis/Wisteria/)

Thanks

------
SyneRyder
Does anyone know if this is related to the new iZotope RX 7 vocal isolation &
stemming tools? It does seem to be talking about something similar, especially
when it mentions using the same technique to split a song into instrument
stems.

(Or to put it another way - there is commercial music software released in the
last year that lets you do this yourself now.)

[https://www.youtube.com/watch?v=kEauVQv2Quc](https://www.youtube.com/watch?v=kEauVQv2Quc)

[https://www.izotope.com/en/products/repair-and-
edit/rx/music...](https://www.izotope.com/en/products/repair-and-
edit/rx/music.html)

~~~
pizza
Going back further, X-Tracks did this ~5 years ago
[https://vimeo.com/107971872](https://vimeo.com/107971872)

That said, I think a deep learning approach will likely do a lot better (and
be a lot easier to develop, imo)

Also, check out Google's Magenta project; it aims to use ML in various music /
creativity projects.

I personally plan on doing a project that will involve audio source separation
as well as sample classification; a good trick for analyzing audio data is to
convert it into images (maybe with some additional pre-transformations
applied, such as passing it through an audio filter that exaggerates human-
perceived properties of sound) and then just use your run-of-the-mill, bog-
standard, state-of-the-art image classifiers on the resulting audio
spectrogram with some well-chosen training/validation sets.

~~~
Eli_P
That's interesting, are you going to make something like LabelImg[1]? I've
been looking for something like that for audio, yet I'm not sure about
treating audio as images. I've heard of this trick, but NN for audio better do
work with RNN, GRU[2], maybe LSTM; and images are processed with CNN.

[1]
[https://github.com/tzutalin/labelImg](https://github.com/tzutalin/labelImg)
[2]
[https://en.wikipedia.org/wiki/Gated_recurrent_unit](https://en.wikipedia.org/wiki/Gated_recurrent_unit)

~~~
pizza
I was gonna do something involving about 3 different neural nets:

a source separator: taking one audio stream as input and producing a set of
audio streams as output.

a segmentation regression neural net: takes an audio stream as input and
returns start and stop timestamps of individual samples as output, or
alternativey, just trimmed copies of the audio stream

sample classifier: takes an audio stream and then returns “kick drum”, “snare
drum”, “voice”, “guitar”, etc

then the pipeline would be like

source separator => segmenter => sample classifier

Hopefully with this I would be able to decompose music into constituent parts,
useful for remixing and other kinds of musique concrete

I expect that the results with a deep pretrained generic image model + some
tweaking with more niche training examples will be satisfactory, but if not it
would be a good excuse to experiment with more traditionally seuqnece-oriented
network architectures

------
bsaul
I used to work in an audio processing research center back in 2003, and
colleagues next to me were able to isolate each instrument in a stereo mix
live using the fact that they were "placed" on different spot in the stereo
plane.

Don't ask me how they did that, it was close to magic to me at that time, but
i'm sure it wasn't neural networks. Although it probably involved convolution,
as it is the main tool for producing audio filters.

If anyone has more info on the fundamental differences of the neural network
approach compared to the "traditional" one, i'd be thankful.

~~~
tomc1985
There's a trick you can do to isolate vocals from some music by essentially
flipping one of the stereo channels and combining their waveforms. All the
stereo data cancels out and you're left with anything not panned hard center.
Recombine that with the original stereo file converted to mono and you then
get the vocals, usually a bunch of cruft from the reverb, and anything else
panned hard center

~~~
tobr
_Removing_ the center is possible, but how exactly would you "recombine" it
with the original to _keep_ the center? Correct me if I'm wrong, but I don't
think the math works out like that.

Say we have mixed three sources, let's call them L (panned hard left), C
(center) and R (hard right). Then the left channel has +L+C, and the right
channel has +R+C.

Now we phase invert one of them, say the right channel, and combine them. The
new mono file is +L+C-R-C. +C-C cancels out and we're left with +L-R.

Since +R and -R essentially sounds the same, it sounds like if we had
originally done a mono mix of L and R (+L+R).

But we can't combine this with a straight mono conversion (+L+C+R+C) in any
way that will remove both L and R. All we can do is reproduce +L+C or +R+C.

~~~
tomc1985
A stereo audio file only has two channels, mixed from any number of sources:
L+R. This described process essentially converts the track (we'll call X) to
mid/side (M/S) form, which is used by some microphones and other outboard
equipment. We're exploiting properties of both forms of the audio to produce
our isolated center.

Also if you have the removed data all you have to do is invert it and sum it
against hte original material and it will cancel out

I described it slightly incorrectly in the original post but I have used this
process to extract vocals for remixing

The first inversion produces one channel containing nothing but stereo data,
as you're summing L+Ri to produce S -- the center channel data (technically,
anything equivalent in both stereo channels) cancelled itself out. (Note that
if you combine L+Ri you should get a zero-ed out waveform)

The second inversion, produced by combining the first inversion (Ri) and the
original source converted to mono (L+R), produces the isolated center channel
(M). It works because you are essentially only cancelling out stereo data you
generated in the first inversion

If you don't believe me, load up a copy of Audacity or Sound Forge and try it.
(Note the music needs to be _uncompressed_ ) One track this works with is the
original mix of "Day n Nite" by "Kid Cudi" if you can find a WAV copy
somewhere. It doesn't work with a lot of music

So...

    
    
      L, R = stereo_split(X)
    
      Ri = inverse(R)
    
      S = L+Ri
    
      Si = inverse(S)
    
      M = Si+(L+R)
    
      Note: L+Ri should result in empty waveform

~~~
tobr
Trying to follow this, but I’m lost at a few points.

If you say you have a file where +L-R = 0, then L = R? That’s just a mono
file?

L/R stereo and M/S stereo conversion works like this. M = L+R, S = L-R (or S =
R-L, either one is fine).

If you expand out your expression M=Si+L+R, where Si=-(L-R)=-L+R, you have
M=-L+R+L+R, which means you are saying M=+R+R. This doesn’t seem right.

If you just want an actual M, that’s just a mono conversion of L+R. For sure,
if the recording is M/S, the two sides cancel out and you will extract the mid
channel, which would often contain song - but there’s no reason to do all the
repeated inversion and mixing to do it!

~~~
tomc1985
+L-R = 0

+L+R gets you a mono file (before volume reduction). +L-R cancels out. Try it
out in a wave editor with a sine wave

The figures I wrote in my previous post use + as a summing operation, not to
signal polarity

I discovered the process on my own, so I'm sure there are better ways to do
it, I'm not a recording engineer :)

------
tasty_freeze
Trivia: Avery Wang, the guy who invented the Shazam algorithm and was their
CTO did his PhD thesis on this topic:

[https://ccrma.stanford.edu/~jos/EE201/More_Recent_PhD_EEs_CC...](https://ccrma.stanford.edu/~jos/EE201/More_Recent_PhD_EEs_CCRMA.html)

``Instantaneous and Frequency-Warped Signal Processing Techniques for Auditory
Source Separation'' (1994)

------
sytelus
Please stop publishing on Medium. I'm getting error "You read a lot. We like
that. You’ve reached the end of your free member preview for this month.
Become a member now for $5/month to read this story".

Not gonna do that.

~~~
computerex
Share this sparingly lest it gets "fixed" ;)
[https://outline.com/SauXTY](https://outline.com/SauXTY)

------
En_gr_Student
There is a lagged autoregressive technique used in forensic analysis that
allows 3d reconstruction using 1d (mic) sound.

A CNN should be able to back that out too, and do other things like regenerate
a 3d space. In the right, high-fidelity, acoustic tracks could be the spatial
information to reconstruct a stage and a performance. It would be
neat/beautiful/(possibly very powerful) to back video out of audio in that
way.

------
plaidfuji
The presentation of this project alone is a visual tour de force to say
nothing of the technical quality. Beautiful and easily digestible post. As
with any interesting, non-toy applied ML problem, the dataset generation is
really where the innovation is. It gets a neat little graphic at the end. As
far as how the author characterizes the problem, I think the word he's looking
for is "semantic segmentation" \- he's trying to classify each pixel of the
spectrograph as vocal/non-vocal. I'd be curious if he could drop the dataset
into pix2pix-style networks and achieve the same results.

------
8bitsrule
Question: has any progress been made in removing reverb?

There are many, many historical recordings (and modern ones made in less-than-
ideal circumstances) that suffer badly from reverb. Seems like a valuable use-
case that -ought- to be in reach today.

~~~
marzell
I don't have an answer for this. However, since they do have effective blur
reduction/elimination techniques for visual images, I imagine that with enough
resources we are not far from reverb/echo reduction in audio.

~~~
Eli_P
Blur elimination is usually done with unsharp mask, which works by blurring
raster even more and comparing to the original. The output makes the edges
more sharp but some information is lost anyway.

Reverb elimination can be done without losses, just with distortions depending
on the implementation. To do that, one have to recover cepstral[1]
coefficients (with NN) and feed them to spectral filters (no NN needed).

This is feasible, provided somebody prepares a training data set consisting of
lots of pairs (sound, same_sound_with_reverb), where sound would be a voice,
instrument, applause, etc. and with a different reverb settings. Very likely
you'll have to use enormous sample rates, way beyond 44100, because you're
supposed to deal with infinitesimal impulse response... Adds up to hardware
requirements.

I feel like I've oversimplified something, but it can be done, just lots of
fidgeting with all the training sets and a training process itself.

[1]
[https://en.wikipedia.org/wiki/Cepstrum](https://en.wikipedia.org/wiki/Cepstrum)

~~~
8bitsrule
Thanks for that, and for the link. I recalled playing long ago with a software
that used convolution to let a user choose from different 'reverb spaces'
(e.g. Taj Mahal) to put audio into.

It occured to me that in many/most cases the audio _from the source_ arrives
first. So -in some cases- multiple models of the 'verb space' could be
constructed/refined to allow filtering. Probably much easier for a lone
speaker in a small, geometrically simple room -without- a P.A. Maybe not so
easy for a speaker using a P.A. in a cathedral. But the Power of Fourier is
mighty.

------
switchbak
Just wanted to mention there's some folks doing realtime source separation
(not sure exactly how they've implemented it) with a DNN for reduction of
background noise in, eg: Skype conversations.

I'm not involved with them in any way, but I've been amazed with its ability
to cancel out coffee-shop style noise.

Check out [https://krisp.ai/technology/](https://krisp.ai/technology/) \-
Mac/Windows. I wish they had Linux support!

Edit: Appears they don't have Windows support yet.

~~~
brucemoose
> Uninterrupted Voice The same krispNet DNN, trained on hundreds of hours of
> customized data, is able to perform Packet Loss Concealment (predicting lost
> network packets) for audio and fill out missing voice chunks by eliminating
> "chopping" in voice calls.

This is both fascinating and horrifying at the same time! I wonder if/when it
would be possible to rewrite whole words in real time using a voice that
sounds just like you.

------
syntaxing
Clicked into to the article because I was curious how the training set was
created. Using the acapella version is an amazing idea! Wished the article
went more in-depth about this section.

------
Animats
Is it possible yet to take a recording of singing and generate a model of the
singer for synthesis, like a Vocaloid?

------
petra
Question: Currently building earphones with great active-noise-cancellation is
a secret kept within few companies.

This means they're expensive($300 headphones from Bose etc).

Do neural network make this simpler ?

And do you think they can be applied cheaply enough,say for $99 headphones ?

I assume this will sell really well, and justify creating a dedicated chip,
with time.

~~~
sonnyblarney
Doing any kind of neural net anything in realtime is usually not possible due
to processing power requirements.

------
sonnyblarney
Soon enough there will be an AI filter that will take any old hacky, coughing,
wheezing singer running around on stage, singing out of tune - and turn it
into virtuoso chops. Maybe even derived from their own voice.

Which will give entirely new meaning to 'lip synching'.

~~~
vonseel
I can’t wait.

Seriously, of all instruments, things like vocals are one of the most
heartbreaking to work on and learn. Born male and want to sound like a female
singer? You’ll never be able to do that. Same applies for women who wish they
had male singing voices. It simply isn’t possible (with the rare exception, I
guess).

Or maybe you just don’t like your voices timbre. You can take lessons for
years and learn to sing on pitch, and you can alter your vocal tone, but you
can’t control every aspect that gives your voice it’s unique sound.

I guess sometimes you just have to be happy with what you have.

------
samstave
A fun thing to do with this would be to slurp the lyrics from one song - the
beats from another, some other stream from another and remix the “threads”
together into something new.

Basically a giant equalizer that allows you to dim or brighten each channel
from multiple sources.

------
smrtinsert
This project I've found to be very useful if you want access to something that
like what the article describes.
[http://isse.sourceforge.net](http://isse.sourceforge.net)

------
canada_dry
I'd like to try using this kinda thing to build an automated beat saber map.
The ability to orchestrate the beats very specifically would make for
excellent mappings.

Alas so many projects, too little time!

------
dharma1
Sounds pretty good but exhibits the same artifacts/phasing that I've heard
with other source separation. Good for forensics etc but I wouldn't use this
for music production

------
jtbayly
There was a similar demo (I think from Google) here on HN sometime last year
that was far more impressive. I can't seem to find it though. Anybody know
what it was?

------
exabrial
Are there any hearing aid manufacturers taking this approach? Quite
incredible.

