Hacker News new | past | comments | ask | show | jobs | submit login
Spleeter: Extract voice, piano, drums, etc. from any music track (github.com/deezer)
1460 points by dsr12 on Nov 3, 2019 | hide | past | favorite | 175 comments

For your listening pleasure, here's a full-length demo. I decided to use the Jonathan Coulton classic "Re Your Brains", because I can legally share and modify his music under its Creative Commons license.

First, the original:


Now the derived stems:

Vocals: https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...

Accompaniment: https://mwcampbell.us/tmp/spleeter-demo/jonathan-coulton-re-...

Note: I'm not affiliated with this project or Mr. Coulton. I just think this is a cool project and wanted to share.

Wow, I've listened to several attempts at this over the years, but this one is waaay better than anything I've heard. It's almost perfect.

IME, this tool is certainly an order of magnitude closer to the Holy Grail than anything I've ever heard. Kudos to Deezer R&D.


Holly cow! The separation is sort of perfect! Thanks for the demo.

"Indistinguishable from magic."

Did it just work or did you have to supply something beyond the original JC track?

Just worked.

They need to link these on the project readme as a demo.

While it's a great technology, the result sounds somewhat robotic. On the original recording the voice sounds soft, but after separation it sounds like it is synthesized or passed through a vocoder, something is missing. The voice contains pieces of strumming sound. Guitar also sounds "blurred", as if someone cut an object from the picture and blurred the cut to make it less visible. Clap sound is distorted, on the original recording it sounds the same, but after separation it sounds different every time, as if it was filtered or compressed with low bitrate.

It is amazing how the ear manages to distinguish all the sounds without distortion.

> While it's a great technology, the result sounds somewhat robotic.

That's like complaining about how bad the pig plays the violin. This is absolutely incredible. The complexity level for this problem is right off the scale and the software does a passable job of it. Given some time and more training data and a few more people working on it this has serious potential.

I think you meant how well the pig paints. [1] [2]

[1] https://pigcasso.org/wp-content/uploads/2018/12/7.jpg [2] Also, I am not affiliated with anything in particular that has been mentioned, or with the pig.

I think the original was in relation to pigs dancing.

That roboticness is because there's overlap in frequencies between the voice and the instruments.

I have no idea how this tool splits them up at the implementation level but I imagine it tries to split it up based on frequencies and when it lifts out the voice, it's cutting out a ton of frequencies that would normally be in your voice so now it sounds very unnatural, blocky and metallic.

With studio quality headphones I can notice a massive difference for the worse between the original and separated vocal track. It reminds me of when I turned on a noise gate too high when recording audio for my courses. That noise gate clamps down on certain frequencies to help eliminate room noise but it also dampens or removes natural frequencies that occur on the low end of an average male voice. It gives that same very jagged sounding audio waveform.

I don't argue that it's a great technology, but some of the neighbour commenters wrote things like "perfect", and to me it doesn't sound like "perfect" yet.

For example, in the beginning, if you listen to the phrase "from the office ...", in original recording it sounds smooth, like a single phrase (and the voice is warm and pleasant), but in the separated track it sounds like it is synthesized from pieces that are not properly connected, like vocaloid songs. It sounds little harsher. Transitions sound unnatural. And the phrase "heya Tom" in the separated track is split into "heya" and "Tom" with some unnatural sound between (or maybe in the beginning of "Tom"). It is like transitions you can hear in vocaloid tracks. Or the kind of artifacts you get if you over-compress an MP3 file.

And "it's good to see you" part also sounds robotic.

Maybe it's losing some of harmonics, but in a different way for different syllables and they don't sound like a single phrase anymore.

It’s perfect relative to the most wildly optimistic expectations anyone could have reasonably held beforehand.

The vocals don’t have any significant residual artefacts from drum hits or any residual bleed-through of instruments playing the same notes. It’s magical.

Maybe the source was compressed audio instead of flac/wav?

Edit: the source is an mp3, which removes audio frequencies based on perception/masking with other frequencies. It's perfectly normal that it is showing artifacts. A better source is needed.

I encoded the source in MP3 for your convenience. But the source that I fed into Spleeter was FLAC.

Edit: I uploaded the original FLAC to my web server last night, before I decided that MP3 would be more convenient:


this is like someone just flew you to the surface of Mars in an hour and your only comment is that the ride was bumpy. The demo above is mind blowing.

I basically want to run this over every Steve Gadd recording I have.

People are hard at work on denoising of audio using neural networks also. I would expect that if one trained denoisers for each type of source separated here, and passed the separated audio through them, things would get even better.

spleeter is using neural networks.

Yes, for source separation. Denoising is generally a separate task. A denoising network would take in a noisy signal for a single source and outputs a cleaned up one. It would be trained on the specific source type, for example vocals.

Gotcha, thanks.

It's easy to pick up on lots of little things, but this is still extremely impressive. It's also more than good enough for people who want to practice singing/playing over the song by themselves.

I think the next step would be to train a network that can un-robotify songs and then run it on this.

> On the original recording the voice sounds soft, but after separation it sounds like it is synthesized or passed through a vocoder

funny, the original recording seemed kind of robotic to me! maybe not robotic, but like it's been filtered somehow. but that might just be my not-so-great headphones

The original lead vocal is definitely processed. You can hear that processing clearly if you apply the classic vocal removal (really, center removal) effect. (My favorite implementation of that is the Center Cut DSP for foobar2000.)

So Jonathan Coulton is now the new Suzanne Vega?

> So Jonathan Coulton is now the new Suzanne Vega?

Bravo. For people who didn't get the sublime reference, Suzanne Vega's song Tom's Diner was a benchmark test during development of the MP3.[1]

[1]: https://en.wikipedia.org/wiki/Tom%27s_Diner#The_"Mother_of_t...

Tom's Diner wasn't just a benchmark. Brandenburg listened to it obsessively, over and over, to the exclusion of a lot of other content he might have done well to pay more attention to. Which is what tends to happen when you work on audio processing code, for better or worse.

So that's why the MP3 format mangles male vocals so badly at all but the highest bitrates. Now you know the Rest Of The Story... or at least you've read it on the Internet.

This sounds really interesting. Is there a good writeup/oral history about this?

The reference is even more apt because the two of them have collaborated. She sang the led vocal on his song "Now I Am an Arsonist" (from the album Artificial Heart).

I gave a talk at pycon this year about dsp [1], specifically some of the complexities surrounding this. I came across a few other ml projects that claimed to do this as well, and the biggest hold up is getting enough properly trained data, tagged appropriately, in order to let the models train correctly. in the git repo of this project they also explicitly state you need to train on your own data set, though you can use their models of your like. YMMV. I will love to try this out, as it's definitely a complex bit of audio engineering. That said, i loved learning everything i did preparing for my talk and need to finish up some other parts of the project to get the jukebox working... Maybe this will help :)

1. https://m.youtube.com/watch?v=fevxy-s0vo0

Seems like most music (from the 70s on at least) is recorded multi-track and the data is out there, just not accessible to anybody. If you ever watch Rick Beato videos, he takes classic songs and isolates vocal/drum/etc. tracks all the time, I'm not sure how he has access to them: https://www.youtube.com/playlist?list=PLW0NGgv1qnfzb1klL6Vw9...

But you probably don't need to bother with old recordings since there is SO MUCH music being produced via tracking software right now I feel like it should be possible to get a pretty big dataset - the difference being, of course, professional production that affects how all these things sound in the final mix.

Although... if you have enough songs with separated tracks, couldn't you just recombine tracks and adjust the settings to create a much, much broader base for training? Just a dozen songs could be shuffled around to give you a base of 10,000+ songs easily enough. That might lead to a somewhat brittle result but it would be a decent start.

Rick says in one of his videos that he and some of his buddies have got old copies of the original source (separated) tracks, and they kind of pass them around between each other.

I find that pretty amazing given the litigiousness of some in the music industry, but there we are.

Side note: I discovered Rick Beato a few months ago and I've watched heaps of his videos. It's really fascinating hearing old classics torn down to their constituent parts. Here's one of my favourites of his: https://www.youtube.com/watch?v=ynFNt4tgBJ0 (Boston - More than a feeling).

Rick Beato is excellent. Nahre Sol and Adam Neely also do great analyses of things. Adam in a more theory oriented way and Nahre in a more feeling and composition focussed way; "Funk as digested by a classical musician" for example looks at funk to try and find the key structures of the style which illuminates things I might not have noticed otherwise.

Also 8 bit music theory has very solid video essays on varying compositional concepts that are reflected using game music. I actually find his work most consistently satisfying. Neely and Beato are great but lower s/n ratio. Nahre not enough watches to say but thumbs up for her, too.

Don't forget JazzDuets's channel. His content seems to be most mature and uses actual playing a lot to tune your ear. I find him actually a bit too advanced for my level but I like a lot his very humble and friendly personal touch.

Isn't that litigiousness mainly around money? Another important currency in the music world is respect. E.g., look at how Chamillionaire talks about Weird Al: http://yankovic.org/blog/2006/09/13/high-praise-from-chamill...

Given that, I expect that a show titled "What Makes This Song Great" will do fine. Who doesn't love having somebody note the non-obviously good parts of their work. Especially if, as with Weird Al, proper royalties are paid.

The artists and performers are often quite reasonable. When you sign a major label deal, many sign away the rights to and control of their work in an effort to make a living and support their families. They need the money, also maybe a gold record.

Once they sign, the RIAA and label lawyers get to work, so the creator may not have any influence or own the masters.

Artists have a good chance of getting the point that authentic publicity is gonna garner authentic fans with authentic ticket stubs, but in the contract, on page 147 section 14a, under "Rights and Royalties" states ...

OMG yes. I watch very little YouTube, but reading these comments I thought "this Rick guy is probably that one I saw a couple months ago, his separated Boston tune was really amazing." And there it is.

Thanks for the Rick Beato mention. Just spent a couple hours watching some of his breakdowns. Fascinating stuff and reminded me how much I like this type of analysis.

I've personally collected thousands of multitracks, stems and remix kits.

Sources are e.g. multitracks that someone leaked (like original unmixed Madonna sessions), constructed MOGG files from various Rock Band games, stems prepared for remixers etc.

I'm wondering if you could even use this to separate unrelated pieces of audio? E.g. instrumental music and someone reading a book out loud. And if you could use this to generate useful training data.

I'm sure you've thought of this, but could/have the tracks from the Rock Band games be used for training?

There are thousands of them and they're separated into different instrument tracks. They even had bands re-record songs sometimes where seperate masters couldn't be found. If I recall correctly, Third Eye Blind did this for Semi-Charmed Life.

To add there is a format of music called Stems designed for DJ's and live remixers from Native Instruments which is a disassembly of the song into it's various parts.


FYI The term "stems" to refer to the individual tracks of a piece of recorded music is a lot older than NI's format. I love NI, but I'm annoyed that they chose to appropriate the industry standard term as a proprietary product.


They published it as an open standard. Many apps and stores support it. The purpose was to take an informal industry practice and make it into a formal portable open standard.


This is better than I originally thought, but it's still a bit confusing. The Stems file spec (available via registration) is basically an MP4 container with some JSON metadata. This seems to have the usual donwsides of MP4 patents, but it's actually about as good as any standard a pro audio software company has released.

Ideally, I'd have liked to have seen a completely open audio codec used for both encoding and container, but MP4 is a pretty safe bet for compatibility , and it's not really NI's fault that it has some patent issues.

All in all, I could pedantically argue the "open" status, but I'll instead give credit where it's due, and give kudos to NI for releasing a pretty damn usable file format.

I'm even happy that it's limited to 4 parts. For the purposes of live performance with DJ style gear, this is plenty. If a performer wants more parts then they're probably going to be creating some or all of those parts. Either way, they'll probably be using something more like Ableton rather than Traktor.

idk anyone who uses this as a proprietary format. "stems" is an industry standard term.

As discussed elsewhere in this thread, it's not as I suggested, a proprietary format. It's still a format created by NI which appropriates the industry standard name. For a list of parties using NI's implementation of the file format see https://www.stems-music.com/stems-partners/ .

I'm happier that it's a (mostly) open standard, but I'm still slightly annoyed at the confusion that comes from NI appropriating the industry term. It's like if I released a non-text representation of storing data using a particular subset of technology that was standardized, and then called it "The Binary" format. Technically nothing wrong with it, but it's bound to cause confusion!

The SNES is a 1990s game console. Its music is generally synthesized by the SPC700 chip, from individual instruments stored in 64 kilobytes of RAM (so the instruments often sound synthetic and muffled). The advantage is that it's possible to separate out instruments.


- Programatically gather a list of all samples used in the song

- Generate many modified .spc files, each of which mutes 1 sample via editing the BRR data.


- Use a modified SPC700 emulator which you can tell to skip playing a specific sample ID.

Record the original song to .wav. And for each sample, record "the song with one sample muted", and take (original song - 1 sample muted), to isolate that 1 sample. If the result is not silent, you have isolated 1 instrument from the original song.

The results may not always be perfect, and will need manual labeling of instruments, or manually merging together multiple piano instruments. But I think this process will work.

I'd guess this would result in a model for separating SNES music.

I would guess Garageband tracks would be more representative of 'real' instruments than chiptunes.

BTW, in the play-along mode in GB where you get pre-recorded accompaniment tracks, you can replace the drummer's kit with a drum machine and hang some filters onto it. Much fun is to be had.

However, this reminds me that filters probably make things much harder for the separation model, with the explosion of possible sounds from an instrument or voice. (Vishudha Kali's music is a nice illustration of that.)

I did come across the person who did a similar project (automating instruments based on previously recorded music), however in one project that was playing live instruments from an NES, the signals were already separated. That said, I'm not following the context of your response to my post.

You mentioned that "the biggest hold up is getting enough properly trained data, tagged appropriately, in order to let the models train correctly." I think using SNES music as training data is a viable way of getting hundreds of songs' worth of training data in a fairly automated fashion. (I'd estimate that each game has 10 to 80 songs which can be used for training, I have 5 to 10 games of OSTs already downloaded, and each song is only 64 kilobytes and takes minimal disk space before rendering to WAV.)

This is very timely. I've been working for about 3 months now on a utility that transforms mp3's to midi files. It's a hard problem and even though I'm making steady progress the end is nowhere in sight. This will give me something to benchmark against with for instance voice accompanied by piano. Thank you for making/posting this.

For an idea how this project is coming along:


Yes, it's terrible :) This particular file the result of the following transformations:

midi file -> wav file (fluidsynth)

wav file -> midi file (my utility)

midi file -> wav file (fluidsynth once more)

wav file -> mp3 file (using lame)

Of course it also works for regular midi files (piano only for now). The reason why I use the workflow above is that it gives me a good idea how well the program works by comparing the original midi file with the output one.

But I did not yet have a way to deal with piano/voice which is a very common combination so this might really help me.

Possible applications: automatic music transcription, tutoring, giving regular pianos a midi 'out' port, using a regular piano as an arranger keyboard, instrument transformation and many others.

Having fun!

Edit: I've done a little write-up: https://jacquesmattheij.com/mp3-to-midi/

Just FYI in case you weren't aware - Ableton Live and several other DAWs have this capability built in. It's far from perfect, but great for humming a melody and then quickly turning it into MIDI.

There’s a pretty cool library from Googles Magenta team that does piano transcription pretty well. https://magenta.tensorflow.org/onsets-frames

They say it’s only really good for piano, but I definitely use it for all kinds of samples. Great for inspiration

So, I used it to run the same toccata test, here are the results:

The Magenta code:

f_measure 71.56 precision 65.75 recall 78.49 accuracy 55.72

My little batch of code:

f_measure 77.74 precision 93.40 recall 66.57 accuracy 63.58

Interpreting the results is tricky, they are obviously better on 'recall' but that is at the expense of being much less precise which gives a much better result for my code; besides it is nicer to listen to because there are far fewer spurious notes.

My code also runs about 100 times as fast and uses very little in terms of resources. So, rather than being depressed it looks like I'm on to something :)

Oh wow that's extremely promising! Yeah the magenta thing destroys my browser when I run it. Still feels like magic though haha. I would be extremely interested in some other options so good luck!

Thank you! If you have any files you want me to test with then feel free to send them, email is in my profile.

Oh cool, thank you for that, I did not know about this yet. That may come in very handy.

How good is it (% accuracy) for polyphony? I can upload the toccata original if you want.

How do you do this in Ableton?


There are four options.

> Convert Melody to New MIDI Track

> This command identifies the pitches in monophonic audio and places them into a clip on a new MIDI track.

> This command identifies the pitches in monophonic audio and places them into a clip on a new MIDI track.

Neat. So, the big difference then is that I do fully polyphony but I'm still limited to 'just' piano but that's already hard enough for now.

Thanks so much. I thought I had read through the manual but there's all kinds of stuff I've missed.

messed around with the 2stem model for a bit and it's reasonably good. I think phonicmind is still a bit better - phonicmind tends to err on the side of keeping too much, while the 2stem model tries to isolate aggressively and often damages the vocal as a result (distorting words by losing some harmonics, or losing quiet words entirely)


https://files.catbox.moe/wjruiv.mp3 (phonicmind)

https://files.catbox.moe/uuzot3.mp3 (spleeter 2stem)

you can hear spleeter does better at actually taking out the bass drums, but phonicmind never loses or distorts any part of the vocal, while 2stem occasionally sounds like singing is through metal tube (harmonics are missing). will try to read instructions more carefully and see if there's some way to fix.

For those who, like me, hadn’t heard of PhonicMind before, it’s an online service at https://phonicmind.com/ that charges $4 to $1.5 per song to separate out vocals, drums, bass, and the rest of the sounds. You can upload any audio file to that website and get a 30-second preview of separated parts for it.

An interesting alternative approach for instrument sound separation is to use a fused audio + video model. So, given that you also have video of the instruments being played, you can perform this separation with higher fidelity.

I was fascinated by the work done by “The Sound of Pixels” project at MIT.


That’s quite clever but not really practical : instruments heard in most music produced today aren’t "played" by humans.

Gave this a go, it's an easy install with pip, and results are pretty quick even on an old macbook. Splits into 2stems (vocals/accompaniment) on some random songs I chose actually quite good using the pretrained models provided. Of course, ripping the vocals out of the accompaniment takes out a good chunk of the middle frequencies so some songs sound a bit wonky. Worth a play if you are interested.

Same thoughts here. I ran Thriller, Alligator by Of Monsters and Men, and In Hell I'll be in Good Company by The Dead South on the 2 / 5 / 4 stems, respectively. Impressive results. Definitely agree that some of the middle frequencies show some error.

It would be really cool to create "music mappers"/life sounds tracks like what you can do with pictures & art styles (e.g. https://medium.com/tensorflow/neural-style-transfer-creating...)

known nothing about the results, i suspect that mid-ranges are poorer mainly because human frequency response is most sensitive towards mid-range aka vocal-pitch frequency

It's really good on the 2-stem stuff. On the 4-stem model, it's a bit shy about the bass part, and parts drift in and out. I'd like to try it on a FLAC.

Same, Rage Against The Machine - Killing In the Name came out sounding great. Very cool.

Non-open source products that also separate vocals from music if you need something more "professional".

One-click process: Xtrax Stems 2 (https://audionamix.com/technology/xtrax-stems/)

Professional: ADX Trax Pro 3 (https://audionamix.com/technology/adx-trax-pro/)

Both products use a server which have a much larger pre-trained models. The professional one has added features such as handling sibilance, GUI to edit note following as a guide for the models, and an editor tool for extracting using harmonics.

(Note: I don't work for this company. I do pay for / use their products, and I also happen to know someone who works there.)

I wonder how it would fare on Pink Floyd's "Sheep", where vocals seamlessly transform into instrumentals and it's impossible to tell where one ends and the other begins. https://www.youtube.com/watch?v=3-oJt_5JvV4 (skip to around 1:40)

Interesting to read Thomas Dolby's thoughts on music/technology interfaces--particularly with VR https://semiengineering.com/thomas-dolbys-very-different-vie...

I'd love to see how this compares with Celemony Melodyne. As far as I've been able to determine, Melodyne doesn't use ML, but it's hard to find out exactly what it does use.

Either way, an open source competitor to Melodyne is a welcome addition!

There is a patent for Melodyne that describes looking for harmonics vs time in FFTs, then heuristics for deciding which belong to one note and where it starts and ends, then assigning some of the resudual energy (e.g. noisy onset) to each note.

That's the second time I've seen someone mention Melodyne for separating vocals from a full song source - I don't think that's something it can do? Melodyne is for tuning vocals / instruments & correcting timing on already isolated tracks.

melodyne's editing interface lets you remove different notes from a polyphonic track. so if it's just vocals + other tonal sounds, you can manually remove the other tonal sounds. example: https://youtu.be/2ZjdDatxTaQ?t=83

Hmm, never tried that with melodyne myself and the video you posted isn't a great example of an accurate vocal extraction - those are more like vocal chops and are already pretty dirty to begin with. Based on my experience with Melodyne, I'd be surprised if you could cleanly extract a plain singing vocal without tons and tons of work.

I’ve always assumed Melodyne uses FFT bins.

Is the paper, "Spleeter: A Fast And State-of-the Art Music Source Separation Tool With Pre-trained Models", available yet? What is the methodology?

they made an extended abstract for ismir: http://archives.ismir.net/ismir2019/latebreaking/000036.pdf

methodology is a separate u-net per instrument type to predict a soft mask in spectrogram space (time x frequency), then they apply that mask to the input audio. fairly standard.

I look forward to a day I can click a button to watch videos online without any unnecessary and distracting background music (though it would be better if there were an option and precedent to offer unornamented narrative in video players). The next step after this would be to have live 'music cancelling' headphones for the grocery store (if such a thing still exists).

Wow. Office background noise mute.

The headphones can filter out speech that isn't above a certain threshold. Coworkers nearby can be heard loud and clearly.

Music can play at volume then quiet itself when it detects a person speaking directly to you.

Maybe even a training button to inform it that it is false-positive-ing background noise, or true negative and silencing a co worker you would like to hear.

This is incredible. I made an example using David Bowie's "Changes". A bit robotic, but even the echo is still present in the vocal track. https://www.youtube.com/watch?v=KPlmrq_rAzQ

Does it work with spoken word as well? My use case: improve podcast quality by extracting the vocals only, and leaving out all background and accidental noise.

Not free nor open source but you can try a plugin called izotope Rx for this purpose

I wonder how this compares to Open Unmix (https://github.com/sigsep/open-unmix-pytorch), that one calls itself state-of-the-art as well and is done in collaboration with Sony from what I see of their paper.

Oh I just found out their paper, http://archives.ismir.net/ismir2019/latebreaking/000036.pdf. It's pretty competitive.

Tried it on “Halleluwah” by CAN, had to hear those drums:


Finding drum breaks in music is very time consuming. This is gonna be amazing for music production. Think how 90s jungle would’ve been if they had access to every drum take ever

Wow, this is the isolated track I never knew I needed to hear.

The extracted vocals sound great! But the resulting accompaniment tracks I've heard so far (tried on a handful of songs) aren't of usable quality for most purposes where you'd want an instrumental track – they're too sonically mangled.

Since people are often interested in doing this for a handful of specific tracks and not necessarily en masse, I'd be curious about what a human-assisted version of this could look like and whether you really could get near-perfect results...

What if you explicitly selected portions of the track you knew had vocals, so it could (1) know to leave the rest alone and (2) know what the backing track for the specific song naturally sounds like when there's no singing happening? It could try to match that sonic profile more carefully in the vocal-removed version.

Or what if you could give it even more info, and record yourself/another singing (isolated) over the track? Then it would have information about what phonemes it should expect to find and remove (and whatever effects like reverb are applied to them).

I am working on a product which makes use of this technology. I generate vocal pitch visualizations for karaoke


Cool - but your website needs some work. It looks like a landing page to gather interest rather than something backed by a real product. Show us some videos and singing, before and after, etc.

FYI your email confirmation is going straight to spam on gmail. I'd recommend reaching out to Mailchimp.

Wait the implications of this are huge for electronic music DJs

Audio Neural Transfer Learning could be amazing.

The audio (^F soundcloud) sounds a little warbly... if that can be largely mitigated, then yes, remixes will never be the same

While not great, the phase smearing is orders of magnitude better than most vocal isolation plugins I've used. I only expect it to get better. Very cool!

Is there anything like this for images? Meaning essentially trying to decompose back into photoshop layers. Wouldn't be feasible for lots of stuff that is completely opaquely covering something, but I'm thinking for things like recoloring a screen print, etc.

I have no idea how I managed to re-find these, but I did. Two recent moderately-related/relevant posts:



Ah, so this should've been the answer to my ask HN [1].


Edit: I see someone added it in as an answer 14 hours ago. Well, you have my vote ^^

[1] https://news.ycombinator.com/item?id=21399838

Played with it. The quality of the result is mostly dependent on the amount of clipping in the source file. Basically, all post-90s masters produce weird results with orcs singing in the background. And classics from 60s yield fantastic results.

Might be the training data?

This is awesome. I now have Guns'n'Roses playing in my office, and Axl Rose is a faint voice coming from the garage.

I gave it a try with Megadeth's Holy wars[1], was expecting something like this[2] but got very deep audio. Not sure why but perhaps it's because bassist David Ellefson uses pick which gives the percussive sound and it suits to Megadeth.

Any parameter I could use with spleeter to get a similar output?

[1] https://www.youtube.com/watch?v=9d4ui9q7eDM

[2] https://www.youtube.com/watch?v=uWkykQHsJ-Y

I'm trying to find something to generate tracks without guitar. Then I can cover them with my guitar. Will this software help me?

depends? It splits it into 2 parts (vocal, everything else), 4 parts (drum, bass, vocal everything else), or 5 parts (drum, bass, vocal, piano, everything else). piano isolation is the weakest.

If you want to play along to drum + bass, then yes.

Can we use sample libraries to write, record and simulate desired stems for training? I guess the more naturally played the better?

Not only are the results good, but the music is generated decently rapidly. The implications are clear: whoever wants to make a quick fortune on YouTube should start converting and uploading truckloads of songs as fast as possible. The demand is there. I could easily see that bringing in millions of views.

They’d still get tagged for copyright.

On iOS, Chord AI [1] gives pretty good results for the guitar chords of any music surrounding the phone.

[1]: https://apps.apple.com/us/app/chord-ai/id1446177109

That's sounds pretty nice! Anyone know an Android version? I just checked Yamaha Chord Tracker and MyChord, but both don't seem to be able to use the microphone.

Can someone provide a demo link of source music vs. output?

I gave it a test using the project audio sample. Neat stuff. https://soundcloud.com/thomas-roderick-836298141/sets/spleet...

Holy shit that works way better than I expected. The github project should link to this or a similar example, the technical description doesn't do it justice.

Yeah the fact that it got the reverb in the vocal track is pretty impressive!

I really disagree. It sounds... awful. On par with other approaches, sure, but the main vocals sounds like a case study in digital artifacts and the accompaniment sounds like there's a filter automated over the track.

Far from useful.

Do you know of something superior?

> Spleeter is the Deezer source separation library with pretrained models

Curious, what is Deezer using this for?

On-demand karaoke parties right inside Deezer?

Gonna guess beat/mood/song analysis.

Yeah but they can use the raw song for that I suppose.

Easier way to match voices?

This is so neat! I went looking a few months back for something like this, and the best I found was Google's Magenta.

It would be really cool to use this to feed into Magenta. Think of the mashups!

Karaoke with the most obscure songs!

And even better, karaoke that doesn't suck--most karaoke tracks are covers by cheap bands and you can clearly tell the inferior quality if you're familiar with the song.

Sure, not to mention the awful singing accompanying most karaoke tracks.

Very cool. A close friend of mine (and lead singer in our band May years ago) recently died and we have a couple great recordings from 2 decades ago of his vocals. When we recorded the rest of the instruments they were DI into a Boss BR8. The lyrics sound awesome but the guitar and drums are recorded poorly. This may give us a chance to split the vocals out of the final tracks, and re-record the tracks as a tribute.

Much appreciated.

How could we extract anything but the voice e.g. karaoke?

The repo's quick start instructions [0] show how to use it with the "2-stems" model [1], which separate the source audio into two files: output/source/vocals.wav and outputdir/source/accompaniment.wav:

    $ spleeter separate -i spleeter/source.mp3 \
         -p spleeter:2stems \
         -o outputdir

[0] https://github.com/deezer/spleeter#quick-start

[1] https://github.com/deezer/spleeter/wiki/2.-Getting-started#u...

I'd guess extract the voice and then subtract it from the rest with something like Audacity. I'm not sure which operation would do that, but I believe that it exists.

Also, other comments here speak of “separating into the voice and accompaniment,” so maybe the model/program already do exactly what you need.

Invert and mix.

When you have an instrumental version of a song (from the same stems as the vocal version) this is already one way to get the vocals out without any fancy machine learning. The main tricks besides what you can do in Audacity like that are properly time-aligning the tracks (even if they drift a bit) and compensating for phase issues and compression. I wrote a dirty tool that does that and I've been meaning to turn it into some kind of nicer GUI version.

I've been doing something like this for a bit in Audition. Center channel extract > invert phase > save as wav > create multi-track project > add original > add modified > up the volume on the vocal extracted modified version until the vocals go away

Or you can do the exact opposite and instead of center channel extracting the vocals you can remove the vocals and use this method to better isolate vocals.

Although if things do fun stuff with stereo it might not be exact.

If you don't mind sharing (even if it's cmd line), I'd love to explore.

I'm pretty sure it does that by default.

Has anyone tested this with Glenn Gould recordings?

Hehe, you want to split his singing and humming into a separate track?

I have a large number of multitrack recordings of contra dances if anyone wants to try training this on them.

You train this on them! (And then put the results on SoundCloud or YouTube.)

Are there any examples I can listen to?

I set up a Colab notebook to try spleeter out for myself. You can try picking up your favorite mp3, renaming it to "audio_sample.mp3", uploading it to the Colab, and spining all the cells on the notebook. Enjoy.


This is great. Thank you! I extended the notebook with an example of downloading a youtube video, extracting the audio, then feeding it through spleeter.

Notebook: https://colab.research.google.com/gist/shawwn/0f286f5d4bc22e...

Cool! That's so handy.

I gave it a quick test using their audio sample file:


I was just testing it out on Gazal, it seems to work perfectly. But when it seems to fail with Qawwalis. My understanding of how this works. Is it safe to assume that the training data from dreezer lacks enough examples of Qawwalis?

Aren't all the most popular audio formats lossy? Extracting full data from lossy compression requires reconstruction. Even if they are able to completely extract all tracks they would have gaps and be very low quality.

If you want lossless audio in a decently popular format then just go looking for FLAC, you'll definitely find some. Even bandcamp uses it [1].

[1] - https://en.wikipedia.org/wiki/FLAC#Adoption_and_implementati...

Just tried it on _Meet The Sniper_. Disappointment :-(

[This was an unreasonable test, and it did really well considering the likely training set. I bet it could do much better with better data. Still... man, I was so hoping for magic.]

Has someone in the Intelligence community approached the author? Oops that's classified. :)

This has implications for extracting sounds from noisy recordings or am I off base? Does it only track pitch patterns?

Source separation of speech and speech denoising is a well established field, more researched in general than music source separation. Intelligence officers very likely have access to a range of well-performing ML tools for extracting speech.

Is there a known approach that attempts to separate all distinct sounds (timbres rather than pitch) in a track? Specifically targeted at electronic music, not standard acoustic ensembles.

That's an interesting idea. Many instruments have greatly varying timbres, though. Combining the timbres back to instruments would require another level of processing.

Somewhat of a tangent, but does anyone have a recommendation of an open source (ideally python) program that can make MIDI from piano audio?

I've been working on this for the last 3 months.

Can you send me your audio file? I'll send you back the midi I can generate; it won't be perfect but it might be usable. See comment elsewhere in this thread.

Awesome! Is your project open source? Can't wait to see it.

Here's the first (but by no means only) file I have in mind:


Stone Eyes from the Final Fantasy VII tribute album Voices of the Lifestream.

Here's an open-source project from Google Deepmind's Magenta Project, that does exactly what you want. https://magenta.tensorflow.org/onsets-frames

There is probably something out there, but I know you can do this in Ableton Live by dragging an audio file onto a MIDI track and it will extract the notes into MIDI for you.

Automatic music transcription is the technical/academic name of this task. Maybe that can help you in your search?

MIDI Guitar 2 from Jam Origin (jamorigin.com) works well even for piano.

Melodyne or Ableton. I've found Melodyne to be more accurate, but still not perfect.

They asked for open source

Well, can it extract the bass track from "And the justice for all"?

Can this also separate the backing vocals from the lead singer?

Thank you!! This is really amazing :)

demo links would be helpful

Karaoke everything!

This is amazing - so much possible learning for aspiring producers.

Cool stuff

Pytorch > TensorFlow

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact