First, the original:
Now the derived stems:
Note: I'm not affiliated with this project or Mr. Coulton. I just think this is a cool project and wanted to share.
It is amazing how the ear manages to distinguish all the sounds without distortion.
That's like complaining about how bad the pig plays the violin. This is absolutely incredible. The complexity level for this problem is right off the scale and the software does a passable job of it. Given some time and more training data and a few more people working on it this has serious potential.
 Also, I am not affiliated with anything in particular that has been mentioned, or with the pig.
I have no idea how this tool splits them up at the implementation level but I imagine it tries to split it up based on frequencies and when it lifts out the voice, it's cutting out a ton of frequencies that would normally be in your voice so now it sounds very unnatural, blocky and metallic.
With studio quality headphones I can notice a massive difference for the worse between the original and separated vocal track. It reminds me of when I turned on a noise gate too high when recording audio for my courses. That noise gate clamps down on certain frequencies to help eliminate room noise but it also dampens or removes natural frequencies that occur on the low end of an average male voice. It gives that same very jagged sounding audio waveform.
For example, in the beginning, if you listen to the phrase "from the office ...", in original recording it sounds smooth, like a single phrase (and the voice is warm and pleasant), but in the separated track it sounds like it is synthesized from pieces that are not properly connected, like vocaloid songs. It sounds little harsher. Transitions sound unnatural. And the phrase "heya Tom" in the separated track is split into "heya" and "Tom" with some unnatural sound between (or maybe in the beginning of "Tom"). It is like transitions you can hear in vocaloid tracks. Or the kind of artifacts you get if you over-compress an MP3 file.
And "it's good to see you" part also sounds robotic.
Maybe it's losing some of harmonics, but in a different way for different syllables and they don't sound like a single phrase anymore.
The vocals don’t have any significant residual artefacts from drum hits or any residual bleed-through of instruments playing the same notes. It’s magical.
Edit: the source is an mp3, which removes audio frequencies based on perception/masking with other frequencies. It's perfectly normal that it is showing artifacts. A better source is needed.
Edit: I uploaded the original FLAC to my web server last night, before I decided that MP3 would be more convenient:
I basically want to run this over every Steve Gadd recording I have.
I think the next step would be to train a network that can un-robotify songs and then run it on this.
funny, the original recording seemed kind of robotic to me! maybe not robotic, but like it's been filtered somehow. but that might just be my not-so-great headphones
Bravo. For people who didn't get the sublime reference, Suzanne Vega's song Tom's Diner was a benchmark test during development of the MP3.
So that's why the MP3 format mangles male vocals so badly at all but the highest bitrates. Now you know the Rest Of The Story... or at least you've read it on the Internet.
But you probably don't need to bother with old recordings since there is SO MUCH music being produced via tracking software right now I feel like it should be possible to get a pretty big dataset - the difference being, of course, professional production that affects how all these things sound in the final mix.
Although... if you have enough songs with separated tracks, couldn't you just recombine tracks and adjust the settings to create a much, much broader base for training? Just a dozen songs could be shuffled around to give you a base of 10,000+ songs easily enough. That might lead to a somewhat brittle result but it would be a decent start.
I find that pretty amazing given the litigiousness of some in the music industry, but there we are.
Side note: I discovered Rick Beato a few months ago and I've watched heaps of his videos. It's really fascinating hearing old classics torn down to their constituent parts. Here's one of my favourites of his: https://www.youtube.com/watch?v=ynFNt4tgBJ0 (Boston - More than a feeling).
Don't forget JazzDuets's channel. His content seems to be most mature and uses actual playing a lot to tune your ear. I find him actually a bit too advanced for my level but I like a lot his very humble and friendly personal touch.
Given that, I expect that a show titled "What Makes This Song Great" will do fine. Who doesn't love having somebody note the non-obviously good parts of their work. Especially if, as with Weird Al, proper royalties are paid.
Once they sign, the RIAA and label lawyers get to work, so the creator may not have any influence or own the masters.
Artists have a good chance of getting the point that authentic publicity is gonna garner authentic fans with authentic ticket stubs, but in the contract, on page 147 section 14a, under "Rights and Royalties" states ...
Sources are e.g. multitracks that someone leaked (like original unmixed Madonna sessions), constructed MOGG files from various Rock Band games, stems prepared for remixers etc.
There are thousands of them and they're separated into different instrument tracks. They even had bands re-record songs sometimes where seperate masters couldn't be found. If I recall correctly, Third Eye Blind did this for Semi-Charmed Life.
Ideally, I'd have liked to have seen a completely open audio codec used for both encoding and container, but MP4 is a pretty safe bet for compatibility , and it's not really NI's fault that it has some patent issues.
All in all, I could pedantically argue the "open" status, but I'll instead give credit where it's due, and give kudos to NI for releasing a pretty damn usable file format.
I'm even happy that it's limited to 4 parts. For the purposes of live performance with DJ style gear, this is plenty. If a performer wants more parts then they're probably going to be creating some or all of those parts. Either way, they'll probably be using something more like Ableton rather than Traktor.
I'm happier that it's a (mostly) open standard, but I'm still slightly annoyed at the confusion that comes from NI appropriating the industry term. It's like if I released a non-text representation of storing data using a particular subset of technology that was standardized, and then called it "The Binary" format. Technically nothing wrong with it, but it's bound to cause confusion!
- Programatically gather a list of all samples used in the song
- Generate many modified .spc files, each of which mutes 1 sample via editing the BRR data.
- Use a modified SPC700 emulator which you can tell to skip playing a specific sample ID.
Record the original song to .wav. And for each sample, record "the song with one sample muted", and take (original song - 1 sample muted), to isolate that 1 sample. If the result is not silent, you have isolated 1 instrument from the original song.
The results may not always be perfect, and will need manual labeling of instruments, or manually merging together multiple piano instruments. But I think this process will work.
However, this reminds me that filters probably make things much harder for the separation model, with the explosion of possible sounds from an instrument or voice. (Vishudha Kali's music is a nice illustration of that.)
For an idea how this project is coming along:
Yes, it's terrible :) This particular file the result of the following transformations:
midi file -> wav file (fluidsynth)
wav file -> midi file (my utility)
midi file -> wav file (fluidsynth once more)
wav file -> mp3 file (using lame)
Of course it also works for regular midi files (piano only for now). The reason why I use the workflow above is that it gives me a good idea how well the program works by comparing the original midi file with the output one.
But I did not yet have a way to deal with piano/voice which is a very common combination so this might really help me.
Possible applications: automatic music transcription, tutoring, giving regular pianos a midi 'out' port, using a regular piano as an arranger keyboard, instrument transformation and many others.
Edit: I've done a little write-up: https://jacquesmattheij.com/mp3-to-midi/
They say it’s only really good for piano, but I definitely use it for all kinds of samples. Great for inspiration
The Magenta code:
f_measure 71.56 precision 65.75 recall 78.49 accuracy 55.72
My little batch of code:
f_measure 77.74 precision 93.40 recall 66.57 accuracy 63.58
Interpreting the results is tricky, they are obviously better on 'recall' but that is at the expense of being much less precise which gives a much better result for my code; besides it is nicer to listen to because there are far fewer spurious notes.
My code also runs about 100 times as fast and uses very little in terms of resources. So, rather than being depressed it looks like I'm on to something :)
There are four options.
> Convert Melody to New MIDI Track
> This command identifies the pitches in monophonic audio and places them into a clip on a new MIDI track.
Neat. So, the big difference then is that I do fully polyphony but I'm still limited to 'just' piano but that's already hard enough for now.
I was fascinated by the work done by “The Sound of Pixels” project at MIT.
https://files.catbox.moe/uuzot3.mp3 (spleeter 2stem)
you can hear spleeter does better at actually taking out the bass drums, but phonicmind never loses or distorts any part of the vocal, while 2stem occasionally sounds like singing is through metal tube (harmonics are missing). will try to read instructions more carefully and see if there's some way to fix.
It would be really cool to create "music mappers"/life sounds tracks like what you can do with pictures & art styles (e.g. https://medium.com/tensorflow/neural-style-transfer-creating...)
One-click process: Xtrax Stems 2 (https://audionamix.com/technology/xtrax-stems/)
Professional: ADX Trax Pro 3
Both products use a server which have a much larger pre-trained models. The professional one has added features such as handling sibilance, GUI to edit note following as a guide for the models, and an editor tool for extracting using harmonics.
(Note: I don't work for this company. I do pay for / use their products, and I also happen to know someone who works there.)
Either way, an open source competitor to Melodyne is a welcome addition!
methodology is a separate u-net per instrument type to predict a soft mask in spectrogram space (time x frequency), then they apply that mask to the input audio. fairly standard.
The headphones can filter out speech that isn't above a certain threshold. Coworkers nearby can be heard loud and clearly.
Music can play at volume then quiet itself when it detects a person speaking directly to you.
Maybe even a training button to inform it that it is false-positive-ing background noise, or true negative and silencing a co worker you would like to hear.
Since people are often interested in doing this for a handful of specific tracks and not necessarily en masse, I'd be curious about what a human-assisted version of this could look like and whether you really could get near-perfect results...
What if you explicitly selected portions of the track you knew had vocals, so it could (1) know to leave the rest alone and (2) know what the backing track for the specific song naturally sounds like when there's no singing happening? It could try to match that sonic profile more carefully in the vocal-removed version.
Or what if you could give it even more info, and record yourself/another singing (isolated) over the track? Then it would have information about what phonemes it should expect to find and remove (and whatever effects like reverb are applied to them).
Finding drum breaks in music is very time consuming. This is gonna be amazing for music production. Think how 90s jungle would’ve been if they had access to every drum take ever
Edit: I see someone added it in as an answer 14 hours ago. Well, you have my vote ^^
Any parameter I could use with spleeter to get a similar output?
If you want to play along to drum + bass, then yes.
Far from useful.
Curious, what is Deezer using this for?
It would be really cool to use this to feed into Magenta. Think of the mashups!
$ spleeter separate -i spleeter/source.mp3 \
-p spleeter:2stems \
Also, other comments here speak of “separating into the voice and accompaniment,” so maybe the model/program already do exactly what you need.
When you have an instrumental version of a song (from the same stems as the vocal version) this is already one way to get the vocals out without any fancy machine learning. The main tricks besides what you can do in Audacity like that are properly time-aligning the tracks (even if they drift a bit) and compensating for phase issues and compression. I wrote a dirty tool that does that and I've been meaning to turn it into some kind of nicer GUI version.
Or you can do the exact opposite and instead of center channel extracting the vocals you can remove the vocals and use this method to better isolate vocals.
Although if things do fun stuff with stereo it might not be exact.
 - https://en.wikipedia.org/wiki/FLAC#Adoption_and_implementati...
[This was an unreasonable test, and it did really well considering the likely training set. I bet it could do much better with better data. Still... man, I was so hoping for magic.]
This has implications for extracting sounds from noisy recordings or am I off base? Does it only track pitch patterns?
Can you send me your audio file? I'll send you back the midi I can generate; it won't be perfect but it might be usable. See comment elsewhere in this thread.
Here's the first (but by no means only) file I have in mind:
Stone Eyes from the Final Fantasy VII tribute album Voices of the Lifestream.