Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Free AI-based music demixing in the browser (sevag.xyz)
190 points by sevagh on July 13, 2023 | hide | past | favorite | 45 comments
Hi all,

I've spent some time working on music demixing or music source separation algorithms, which take in a mixed song and output estimates of isolated components (e.g. vocals, drums, bass, other).

I took a popular PyTorch model with good performance (Open-Unmix, UMX-L weights), reimplemented the inference steps in C++, and compiled it to WebAssembly for a free client-side music demixer.




I tried to use it but I had some issues as others in the thread.

I have tried many sources and method over the years and settled on spleeter [0]. Works well even for 10+ minute songs, varying styles from flamenco to heavy metal.

[0] https://github.com/deezer/spleeter


You can always run higher quality models than Spleeter, for example Demucs or the real Open-Unmix UMX-L, in their full PyTorch glory, as long as you have a computer that supports it.

I'm somewhat limited in what can run in a browser by the nature of the project.


Thanks for the suggestion. Good luck with your project.


wait, deezer? I wonder why they made something like this


Cool.

What about the ability to do separate out background noise? I'm thinking of a project like this where the Beatles Live album had screaming fans mixed down?

https://www.wired.com/2017/03/remastering-one-beatles-live-a...

"It doesn't exist as a software program that is easy to use," Clarke says. "It's a lot of Matlab, more like a research tool. There's no graphical front end where you can just load a piece of audio up, paint a track, and extract the audio. I write manual scripts, which I then put into the engine to process."

There are tons of recordings of live performances that could use a little AI TLC.


I've thought about this problem before (but I have so many projects going on that I axed it).

Hypothetically, this is how I would approach it.

I would start by forking Open-Unmix or another 4-stem model (Demucs, MDXNet). The code is oriented towards the 4 sources (vocals/drums/bass/other), which is dominant because the major datasets for training use these stems.

The Open-Unmix training code goes like: ``` x, y1, y2, y3, y4 = load_training_data()

y1_est, y2_est, y3_est, y4_est = unmix(x)

loss = loss([y1, y1_est, y2, y2_est, y3, y3_est, y4, y4_est]) ```

In your case it may be simpler like `noisy mix = clean mix + background noise`.

In that way, I don't even think you're constrained to using a dataset that has 4 stems available (which is a rare quality in a dataset).

Instead, you need a way to acquire or generate screaming and other concert noises.

Then, the new training code could look like: ``` x = load_training_data()

n_samples = x.shape[-1] # get length of music waveform

noise = generate_screams(n_samples)

x_noisy = x + noise

x_est = unmix(x_noisy)

loss = loss(x, x_est) ```

Well, anyway. That's my naive first idea of how I would approach it. But, this relies on having clean background noise/screams _without any music in it_. And I'm not aware of such datasets.


Would love to see something like this auto-mix song transitions in playlists to be more DJ-like.


I started down this road, and have amassed a small collection of files made up of just the first ten and last ten seconds of demixed tracks of every song in my library

I hoped to put together a system that compares the backs to the fronts and lists the output to find cool transitions, but I have no idea how to actually "grade" the similarities

Beyond basic BPM matching on the drums tracks, nothing I've tried has made for anything really compelling (sounds random... :( )


OK, so, tangentially related: I tried to do something once - I took small chunks of songs generated by SampleRNN in an attempt to stitch together the ones that sounded the most similar to create a much longer track.

The script [1] uses Essentia Chromaprint [2] to "grade" the similarity of audio tracks, and combine the ones with the closest chromaprint. No crossfade or BPM matching, just yolo concatenation.

I have a track on Soundcloud which uses the above technique (mashing together short generated clips by their chromagram), trained on Cannibal Corpse [3]

1: https://github.com/sevagh/1000sharks.xyz/blob/master/sampler...

2: https://essentia.upf.edu/reference/std_Chromaprinter.html

3: https://soundcloud.com/user-167126026/1000sharks-domainal-sk...


Interesting, I attempted to do the same as you but stopped just shy of BPM matching.

However I did get sound similarity working using an audio tagging neural net [1]. I chopped off the first and last 15 seconds of every song in my collection and ran them all through this analysis which produces a ~520 dimensional vector. I then targeted specific endings I wanted to match and used Euclidian distance to find the closest matching song beginning.

YMMV but I thought it actually worked pretty well, I just never got to automating the BPM matching. I can try to look for my old script if you're interested :)

[1] https://github.com/fschmid56/EfficientAT


my idea was to let an existing app like djay do the beat matching (which it does really well!) and demixing (less well), and then make a custom app that would act like a midi controller, adjusting the vocals, beats, melody while songs are transitioning, perhaps working off of cues bookmarked in the songs.


In a few years from now DJs will have buttons on their pads to extract voices, beats, melodies etc realtime. Could result into an interesting new style of club music.


DJs have this now, the track still needs to be processed but that can happen in advance or be done on-the-fly (VirtualDJ + a GPU takes about 10s to process a track and can do so seamlessly while the track is playing).


this already exists in the djay app (paid add-on feature). even runs on a smartphone! performs about the same as the app this thread is about.


> re: "realtime"

it separates the components of tracks that it can download (and process), not of a live audio feed


The AI could really just:

- select the best next song given a simple input (for example, a microphone or a camera looking into the crowd)

- mix it into the current song

- repeat


Well, the naive way to do it would be to detect the onset of each beat by assuming it corresponds to kick drums on fours, and then speeding up or slowing down the track you're mixing into to correspond with it, timing it so the fours line up. This would be very easy to implement and over a longish crossfade (say a couple of bars) would be fairly convincing.

Using AI you could probably intuit the "pulse" of the music to find the first beat of the first bar of a measure, and sync tracks up so that they mix in a place that makes musical sense.


Didn't work on Chrome or Brave or Firefox for me.

Console warning:

The AudioContext was not allowed to start. It must be resumed (or created) after a user gesture on the page. https://goo.gl/7K7WLu

Download Weights button also did nothing.


I see that error but in my own testing it never affected the outputs. However, download weights should be working. It's not exactly an interactive progress bar, it just waits until all 45 MB are downloaded until it sets the progress bar to 100%.


Retry. It'll work. (Brave tested)


I like your site. WebAssembly has opened up client side ML inference and it's a really convenient way for devs to serve certain models now. Out of curiosity, why did you reimplement inference in C to compile to WASM instead of using a python to WASM solution?


Good question! So, I wasn't even thinking about WASM to begin with. When I saw llama.cpp and whisper.cpp on the front page of HN, I found the idea exciting - instead of neural networks being magic, I wanted to copy the ggml idea of parsing the PyTorch weights file myself and rewriting the inference code in a lower-level language than Python (or, it's even more accurate to say PyTorch, since there is so much matrix heavy lifting e.g. broadcasting or reshaping that is done for you automatically).

That's when I wrote umx.cpp [1] (which is what this site is based on).

On an unrelated project, a friend of mine mentioned WASM, and as I looked into it a bit more I thought trying to compile umx.cpp to WASM would be a great idea, since I only use Eigen (which is a header-only library that only depends on std).

1: https://github.com/sevagh/umx.cpp


I met a dj few weeks ago that had vocals, drums, and bass all on knobs that could do this on the fly in realtime. I bet it works in a similar way under the hood.

I believe the program he was using was djPROAI[0].

[0] https://www.algoriddim.com/djay-pro-mac


Feature is called "stems" and its also available on most major dj apps like Serato and Rekordbox


anyone know if this kind of real-time filtering what is done w Apple Music’s karaoke feature?


Nice job. I love this trend of fully client-side WASM utilities.


Haven’t tried this demo, but in my experience these open-source models that split music into four components work absolutely fantastically. Not quite perfectly — if you remove vocals, the remaining track may have faint echos of vocals - but astoundingly well compared to the state of the art just 5 years ago or so.

However… what if you want more than four components? What if you want to split a complex arrangement into a separate component for each individual instrument? Does anyone know of any interesting research in this area?


Demucs [1], one of the leading/SOTA systems, has an experimental 6-source model, `htdemucs_6s`, which adds piano and guitar:

>We are also releasing an experimental 6 sources model, that adds a guitar and piano source. Quick testing seems to show okay quality for guitar, but a lot of bleeding and artifacts for the piano source.

I also believe Audioshake [2] (a company in the space) is doing guitar separation as well.

1: https://github.com/facebookresearch/demucs 2: https://www.audioshake.ai/


The primary challenge is the availability of sufficiently large datasets with many separate stems/tracks. Those that find a way to build that, will have the ability to churn out such a model using existing architectures.

The companies doing AI mixing have a huge advantage in this area.


Gave it a shot - as the webpage warns, it does take a bit off time. The results are indeed impressive, although the separation isn't perfect.. each division (drum/bass/vocals) has slight echos of each other (bass line occasionally includes a slight muffled vocal, vocals include some snare, hi-hat, and strings). It's a great starting point though.


This is cool, would love to see an npm package for this!

VirtualDJ[0] software lets you do this with their Stems feature in real time and works really well, and it's free I believe.

[0] https://www.virtualdj.com/


Very cool, thank you for sharing, and making it open source! I love this!


Does this work differently than moises.ai?

Been really enjoying that one


+1 I had the exact same question... would love to know if there are any open source models that match the accuracy of Moises.ai


I tried 2 different files, a .flac and a .mp3, both around 3 minutes and half, and the demixing stopped at 7.5% on both files.


I tried on Firefox now, it also stopped at 7.5%, my previous attempts were on Edge Chromium.


What does the developer console show? After the first few layers (STFT -> FC1 -> BN1) comes the LSTM which is a much slower step (and could look like the site is stopped).


Great work. Thanks for sharing the code!


I couldn't get this to work on Firefox 115, but could in Chrome. Does this require Chrome?


I have been testing and using it on Firefox 113. I wonder if it's related to the maximum memory of WASM. I compile it with `-s MAXIMUM_MEMORY=4GB`, which is supported by Chrome but maybe not Firefox.

Does the developer console say something like "Aborted" or give a memory error in Firefox? If your clip is big enough that it uses >2GB but <4GB of memory, that could explain why it works in Chrome.


Nevermind, got it to work after disabling adblocker


wow i always thought Spleeter was the only game in town. I've been not that satisfied with its results.

is there a command line version of this tool ?


In the free-music-demixer repo, there is a `file_demixer` utility [1] which you can compile using the CMakeLists file - it will use the same quantized weights and produce the same "lower quality" output as the website. Clone the git repo with submodules because it compiles the vendored libnyquist library to load audio files.

Of course, you can always run the upstream umxl model easily:

``` pip install openunmix # pulls in pytorch ```

Installing openunmix installs the `umx` cli:

``` $ umx --help usage: umx [-h] [--model MODEL] [--targets TARGETS [TARGETS ...]] [--outdir OUTDIR] [--ext EXT] [--start START] [--duration DURATION] [--no-cuda] [--audio-backend AUDIO_BACKEND] [--niter NITER] [--wiener-win-len WIENER_WIN_LEN] [--residual RESIDUAL] [--aggregate AGGREGATE] [--filterbank FILTERBANK] [--verbose] input [input ...]

UMX Inference

positional arguments: input List of paths to wav/flac files. ... ```

It will download the weights automatically for you and demix at a higher quality than my site, for two reasons: * Unquantized weights (small impact) * Post-processing step (bigger impact)

I tried to tackle the post-processing step in my C++ code (which would win ~1 dB in quality across all targets) but it's too tricky for now [2]. Maybe some other day.

1: https://github.com/sevagh/free-music-demixer/blob/main/examp...

2: https://github.com/sigsep/open-unmix-pytorch/blob/master/ope...


thanks for the answer and detail! My use case is removing just the drums so that I can make my own drum cover recordings. since all the models are removing all instruments, I have to take the guitar/bass/vocals/other tracks and mix them back together, which produces a very anemic sounding backing track. If I spent the time to train something on just removing drums I suppose it would sound better.


What about Melodyne?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: