
End-to-end Music Source Separation - homarp
http://jordipons.me/apps/end-to-end-music-source-separation/
======
homarp
code is here [https://github.com/jordipons/source-separation-
wavenet](https://github.com/jordipons/source-separation-wavenet)

"Currently the project requires Keras 2.1 and Theano 1.0.1, the large
dilations present in the architecture are not supported by the current version
of Tensorflow"

------
jlarcombe
Interesting this should come up now. I just bought Izotope RX7 which has a
built in 'Music Rebalancer' and, according to their PR, this uses machine
learning techniques. I've been playing around with it this week in spare time
and have been quite impressed with how well it works on quite disparate
sources from my music collection. It's extremely useful when you want to
transcribe details from instrumental tracks that are hard to discern past the
vocals in the overall mix. It seems to work less well on material that's been
crushed in the mix/mastering, and material that has a lot of upper-mid content
competing with the vocals. On clean, open material, it works remarkably well.

------
myself248
"Music source separation" appears to mean "vocal removal" but also setting
aside the vocals into their own track.

That's all I can glean from context, anyway.

~~~
amelius
If only the recording studios would provide unmixed tracks ...

It might even be a significant revenue source as there are plenty of
audiophiles that would pay for being able to change the volume of the string
section compared to the flute section, etc cetera. Or DJs making remixes.

~~~
beat
It's not that easy, unfortunately. Mixing is a lot more complex and involved
than it looks. It's a balancing act, making tracks stand out from the
background or blend into the background, and finally putting it all into a
coherent-sounding whole.

For example, many mix engineers - most, probably - use a buss compressor as
"glue" to help blend the whole mix together. Take out something loud like lead
vocals or drums, and the buss compressor behavior changes, changing the rest
of the mix. A lot of mixers also use sidechain compression to duck parts
depending on other parts (like the kick drum taking a few db out of the bass).

And besides the mix, it needs mastered (which costs money), and packaged
separately from the rest of the album, making it a different product, to be
sold separately.

On the other hand, some fields actually do instrumental tracks with vocals
stripped out as regular products - karaoke and hip-hop tracks are often made
from the original mixes.

Personally, as a mix engineer, I don't _want_ some home genius second-guessing
my mixing decisions.

~~~
amelius
> Personally, as a mix engineer, I don't want some home genius second-guessing
> my mixing decisions.

But the home user could start from your default settings?

~~~
beat
Not really. Not unless they have my DAW (mixing software), and the plugins I
use on it. There's a lot more to mixing than just relative levels. It's kind
of a black art that takes years to learn. I've been at it seriously for a
decade now, and I feel I'm just beginning to get good (although I've gotten
lucky in the past, it wasn't because I really understood what I was doing).

Here's an example. I listen critically to a lot of mixes, in order to learn
from them, so I hear flaws everywhere. One badly flawed classic song, to my
ear, is Al Stewart's "Year of the Cat". It's _extremely_ sibilant on the
vocals. Listen to it on bright-sounding headphones, and every S sound in the
vocal is shrill and hissy-sounding. Adjusting the level won't fix that. I want
to run a de-esser on the vocal. And it's not a knock on the mix in general,
which is mostly gorgeous, lush, classic 1970s Abbey Road sound. 98% of it is
magic beyond my meager skills. But oh god the sibilance.

edit: I should add here that individual tracks, solo'd outside the context of
a mix, can sound really weird and wrong - they get manipulated to blend into
the whole, not to sound great on their own. For example, I high-pass most
guitars at 250-300hz, even though the low E string is way down at 82hz. The
reason is to get them out of the bass space, freeing it for the bass
guitar/synths and kick drum. So if you solo the guitar, it can sound thin and
wrong. But in the mix, you'll never notice the difference - but you'll hear
clearer bass.

------
antpls
So, I'm not an expert in music nor signal processing, but : on some of the
vocal separation examples, we can hear the rythmed instruments in the
background. Would it not be easy to detect those instruments and remove them
from the data? It's not like those are random noises, they are predictable
signals. Is it a limitation of using neural networks, or is the problem harder
than it looks?

~~~
EADGBE
I believe "remove vocal" algorithms already implemented rely mostly on the
pitch and tonal spectrum of the voice.

For your example, what would be done when vocals sync up with rhythm
instruments (straight 1/8's or 1/4 notes played on guitar, sung as well).
Happens all the time, often only a bar at a time, but sometimes longer. e.g.
The Strokes - Last Nite
([https://youtu.be/TOypSnKFHrE?t=62](https://youtu.be/TOypSnKFHrE?t=62)) @1:03
" _they don 't under_-stand" (repeated) could trick the detection into
thinking it's some sort of rhythm instrument.

My point is, all music is extremely rhythmic and requires everything else in
the band to be rhythmic as well.

------
synthmeat
Why don't they just use audio stems as training data?

~~~
francesclluis
I'm Francesc Lluis, one of the coauthors of the paper.

The reason why we don't use audio stems as training data is because during the
preparation of the MUSDB dataset, conversion to WAV can sometimes halt because
of an ffmpeg process freezing that is used within the musdb python package to
identify the datasets mp4 audio streams. This seems to be an error ocurring
upon the subprocess.Popen() used deep within the stempeg library. Due to its
random nature, it is not currently known how to fix this.

------
rorykoehler
This is awesome. I've been thinking about attempting something like this in
order to improve the accuracy of samples I want for tracks. Looking forward to
finding out how it works out

------
cosmic_ape
This mentions ICA and NMF, but in contrast to those the proposed method is
supervised learning, not unsupervised.

I'd suggest the authors try something like an autoencoder in the waveform
domain. That would be a more close analog to the ICA methods.

~~~
cannam
I think it only mentions ICA/NMF to say that they generally aren't applied to
time-domain signals, which are not non-negative and have phase as a
confounding factor.

Here's another intriguing (very different) recent paper on time-domain source
separation:
[https://arxiv.org/abs/1810.12679](https://arxiv.org/abs/1810.12679)

~~~
jordipons_mtg
I'm Jordi Pons, one of the coauthors of the paper.

You both are right! We basically mention ICA/sparse coding as prior work on
waveform front-ends for source separation.

Our method is supervised, and we did not explore the unsupervised learning
approach. However, some people are doing that! Check S. Venkataramani and P.
Smaragdis work!
[https://scholar.google.es/citations?user=hCSSNZwAAAAJ&hl=es&...](https://scholar.google.es/citations?user=hCSSNZwAAAAJ&hl=es&oi=sra)

Although we did our best via comparing against DeepConvSep and Wave-U-Net, I
agree that it would be useful to properly benchmark all that!

------
jimbo1qaz
You could use retro game music (specifically snes) as training data (since
it's very to use programs to render individual channels to WAV).

Note: I'm the author of towave-j, a tool for game music splitting.

