Show HN: Free AI-based music demixing in the browser

uguuo_o · on July 13, 2023

I tried to use it but I had some issues as others in the thread.

I have tried many sources and method over the years and settled on spleeter [0]. Works well even for 10+ minute songs, varying styles from flamenco to heavy metal.

[0] https://github.com/deezer/spleeter

sevagh · on July 13, 2023

You can always run higher quality models than Spleeter, for example Demucs or the real Open-Unmix UMX-L, in their full PyTorch glory, as long as you have a computer that supports it.

I'm somewhat limited in what can run in a browser by the nature of the project.

uguuo_o · on July 13, 2023

Thanks for the suggestion. Good luck with your project.

itsTyrion · on July 16, 2023

wait, deezer? I wonder why they made something like this

tiahura · on July 13, 2023

Cool.

What about the ability to do separate out background noise? I'm thinking of a project like this where the Beatles Live album had screaming fans mixed down?

https://www.wired.com/2017/03/remastering-one-beatles-live-a...

"It doesn't exist as a software program that is easy to use," Clarke says. "It's a lot of Matlab, more like a research tool. There's no graphical front end where you can just load a piece of audio up, paint a track, and extract the audio. I write manual scripts, which I then put into the engine to process."

There are tons of recordings of live performances that could use a little AI TLC.

sevagh · on July 13, 2023

I've thought about this problem before (but I have so many projects going on that I axed it).

Hypothetically, this is how I would approach it.

I would start by forking Open-Unmix or another 4-stem model (Demucs, MDXNet). The code is oriented towards the 4 sources (vocals/drums/bass/other), which is dominant because the major datasets for training use these stems.

The Open-Unmix training code goes like: ``` x, y1, y2, y3, y4 = load_training_data()

y1_est, y2_est, y3_est, y4_est = unmix(x)

loss = loss([y1, y1_est, y2, y2_est, y3, y3_est, y4, y4_est]) ```

In your case it may be simpler like `noisy mix = clean mix + background noise`.

In that way, I don't even think you're constrained to using a dataset that has 4 stems available (which is a rare quality in a dataset).

Instead, you need a way to acquire or generate screaming and other concert noises.

Then, the new training code could look like: ``` x = load_training_data()

n_samples = x.shape[-1] # get length of music waveform

noise = generate_screams(n_samples)

x_noisy = x + noise

x_est = unmix(x_noisy)

loss = loss(x, x_est) ```

Well, anyway. That's my naive first idea of how I would approach it. But, this relies on having clean background noise/screams _without any music in it_. And I'm not aware of such datasets.

kbouck · on July 13, 2023

Would love to see something like this auto-mix song transitions in playlists to be more DJ-like.

jareklupinski · on July 13, 2023

I started down this road, and have amassed a small collection of files made up of just the first ten and last ten seconds of demixed tracks of every song in my library

I hoped to put together a system that compares the backs to the fronts and lists the output to find cool transitions, but I have no idea how to actually "grade" the similarities

Beyond basic BPM matching on the drums tracks, nothing I've tried has made for anything really compelling (sounds random... :( )

sevagh · on July 13, 2023

OK, so, tangentially related: I tried to do something once - I took small chunks of songs generated by SampleRNN in an attempt to stitch together the ones that sounded the most similar to create a much longer track.

The script [1] uses Essentia Chromaprint [2] to "grade" the similarity of audio tracks, and combine the ones with the closest chromaprint. No crossfade or BPM matching, just yolo concatenation.

I have a track on Soundcloud which uses the above technique (mashing together short generated clips by their chromagram), trained on Cannibal Corpse [3]

1: https://github.com/sevagh/1000sharks.xyz/blob/master/sampler...

2: https://essentia.upf.edu/reference/std_Chromaprinter.html

3: https://soundcloud.com/user-167126026/1000sharks-domainal-sk...

tristanc · on July 13, 2023

Interesting, I attempted to do the same as you but stopped just shy of BPM matching.

However I did get sound similarity working using an audio tagging neural net [1]. I chopped off the first and last 15 seconds of every song in my collection and ran them all through this analysis which produces a ~520 dimensional vector. I then targeted specific endings I wanted to match and used Euclidian distance to find the closest matching song beginning.

YMMV but I thought it actually worked pretty well, I just never got to automating the BPM matching. I can try to look for my old script if you're interested :)

[1] https://github.com/fschmid56/EfficientAT

kbouck · on July 13, 2023

my idea was to let an existing app like djay do the beat matching (which it does really well!) and demixing (less well), and then make a custom app that would act like a midi controller, adjusting the vocals, beats, melody while songs are transitioning, perhaps working off of cues bookmarked in the songs.

holoduke · on July 13, 2023

In a few years from now DJs will have buttons on their pads to extract voices, beats, melodies etc realtime. Could result into an interesting new style of club music.

crtasm · on July 13, 2023

DJs have this now, the track still needs to be processed but that can happen in advance or be done on-the-fly (VirtualDJ + a GPU takes about 10s to process a track and can do so seamlessly while the track is playing).

kbouck · on July 13, 2023

this already exists in the djay app (paid add-on feature). even runs on a smartphone! performs about the same as the app this thread is about.

kbouck · on July 13, 2023

> re: "realtime"

it separates the components of tracks that it can download (and process), not of a live audio feed

ragazzina · on July 14, 2023

The AI could really just:

- select the best next song given a simple input (for example, a microphone or a camera looking into the crowd)

- mix it into the current song

- repeat

Gordonjcp · on July 14, 2023

Well, the naive way to do it would be to detect the onset of each beat by assuming it corresponds to kick drums on fours, and then speeding up or slowing down the track you're mixing into to correspond with it, timing it so the fours line up. This would be very easy to implement and over a longish crossfade (say a couple of bars) would be fairly convincing.

Using AI you could probably intuit the "pulse" of the music to find the first beat of the first bar of a measure, and sync tracks up so that they mix in a place that makes musical sense.

wackget · on July 13, 2023

Didn't work on Chrome or Brave or Firefox for me.

Console warning:

The AudioContext was not allowed to start. It must be resumed (or created) after a user gesture on the page. https://goo.gl/7K7WLu

Download Weights button also did nothing.

sevagh · on July 13, 2023

I see that error but in my own testing it never affected the outputs. However, download weights should be working. It's not exactly an interactive progress bar, it just waits until all 45 MB are downloaded until it sets the progress bar to 100%.

code51 · on July 13, 2023

Retry. It'll work. (Brave tested)

potatoman22 · on July 13, 2023

I like your site. WebAssembly has opened up client side ML inference and it's a really convenient way for devs to serve certain models now. Out of curiosity, why did you reimplement inference in C to compile to WASM instead of using a python to WASM solution?

sevagh · on July 13, 2023

Good question! So, I wasn't even thinking about WASM to begin with. When I saw llama.cpp and whisper.cpp on the front page of HN, I found the idea exciting - instead of neural networks being magic, I wanted to copy the ggml idea of parsing the PyTorch weights file myself and rewriting the inference code in a lower-level language than Python (or, it's even more accurate to say PyTorch, since there is so much matrix heavy lifting e.g. broadcasting or reshaping that is done for you automatically).

That's when I wrote umx.cpp [1] (which is what this site is based on).

On an unrelated project, a friend of mine mentioned WASM, and as I looked into it a bit more I thought trying to compile umx.cpp to WASM would be a great idea, since I only use Eigen (which is a header-only library that only depends on std).

1: https://github.com/sevagh/umx.cpp

nanidin · on July 13, 2023

I met a dj few weeks ago that had vocals, drums, and bass all on knobs that could do this on the fly in realtime. I bet it works in a similar way under the hood.

I believe the program he was using was djPROAI[0].

[0] https://www.algoriddim.com/djay-pro-mac

rasjani · on July 13, 2023

Feature is called "stems" and its also available on most major dj apps like Serato and Rekordbox

bredren · on July 14, 2023

anyone know if this kind of real-time filtering what is done w Apple Music’s karaoke feature?

xnx · on July 13, 2023

Nice job. I love this trend of fully client-side WASM utilities.

comex · on July 13, 2023

Haven’t tried this demo, but in my experience these open-source models that split music into four components work absolutely fantastically. Not quite perfectly — if you remove vocals, the remaining track may have faint echos of vocals - but astoundingly well compared to the state of the art just 5 years ago or so.

However… what if you want more than four components? What if you want to split a complex arrangement into a separate component for each individual instrument? Does anyone know of any interesting research in this area?

sevagh · on July 13, 2023

Demucs [1], one of the leading/SOTA systems, has an experimental 6-source model, `htdemucs_6s`, which adds piano and guitar:

>We are also releasing an experimental 6 sources model, that adds a guitar and piano source. Quick testing seems to show okay quality for guitar, but a lot of bleeding and artifacts for the piano source.

I also believe Audioshake [2] (a company in the space) is doing guitar separation as well.

1: https://github.com/facebookresearch/demucs 2: https://www.audioshake.ai/

jononor · on July 17, 2023

The primary challenge is the availability of sufficiently large datasets with many separate stems/tracks. Those that find a way to build that, will have the ability to churn out such a model using existing architectures.

The companies doing AI mixing have a huge advantage in this area.

gnabgib · on July 13, 2023

Gave it a shot - as the webpage warns, it does take a bit off time. The results are indeed impressive, although the separation isn't perfect.. each division (drum/bass/vocals) has slight echos of each other (bass line occasionally includes a slight muffled vocal, vocals include some snare, hi-hat, and strings). It's a great starting point though.

dyscrete · on July 14, 2023

This is cool, would love to see an npm package for this!

VirtualDJ[0] software lets you do this with their Stems feature in real time and works really well, and it's free I believe.

[0] https://www.virtualdj.com/

pcthrowaway · on July 13, 2023

Very cool, thank you for sharing, and making it open source! I love this!

nylonstrung · on July 13, 2023

Does this work differently than moises.ai?

Been really enjoying that one

krat0sprakhar · on July 14, 2023

+1 I had the exact same question... would love to know if there are any open source models that match the accuracy of Moises.ai

henriquecm8 · on July 13, 2023

I tried 2 different files, a .flac and a .mp3, both around 3 minutes and half, and the demixing stopped at 7.5% on both files.

henriquecm8 · on July 13, 2023

I tried on Firefox now, it also stopped at 7.5%, my previous attempts were on Edge Chromium.

sevagh · on July 13, 2023

What does the developer console show? After the first few layers (STFT -> FC1 -> BN1) comes the LSTM which is a much slower step (and could look like the site is stopped).

chaosprint · on July 13, 2023

Great work. Thanks for sharing the code!

humanistbot · on July 13, 2023

I couldn't get this to work on Firefox 115, but could in Chrome. Does this require Chrome?

sevagh · on July 13, 2023

I have been testing and using it on Firefox 113. I wonder if it's related to the maximum memory of WASM. I compile it with `-s MAXIMUM_MEMORY=4GB`, which is supported by Chrome but maybe not Firefox.

Does the developer console say something like "Aborted" or give a memory error in Firefox? If your clip is big enough that it uses >2GB but <4GB of memory, that could explain why it works in Chrome.

humanistbot · on July 13, 2023

Nevermind, got it to work after disabling adblocker

zzzeek · on July 13, 2023

wow i always thought Spleeter was the only game in town. I've been not that satisfied with its results.

is there a command line version of this tool ?

sevagh · on July 13, 2023

In the free-music-demixer repo, there is a `file_demixer` utility [1] which you can compile using the CMakeLists file - it will use the same quantized weights and produce the same "lower quality" output as the website. Clone the git repo with submodules because it compiles the vendored libnyquist library to load audio files.

Of course, you can always run the upstream umxl model easily:

``` pip install openunmix # pulls in pytorch ```

Installing openunmix installs the `umx` cli:

``` $ umx --help usage: umx [-h] [--model MODEL] [--targets TARGETS [TARGETS ...]] [--outdir OUTDIR] [--ext EXT] [--start START] [--duration DURATION] [--no-cuda] [--audio-backend AUDIO_BACKEND] [--niter NITER] [--wiener-win-len WIENER_WIN_LEN] [--residual RESIDUAL] [--aggregate AGGREGATE] [--filterbank FILTERBANK] [--verbose] input [input ...]

UMX Inference

positional arguments: input List of paths to wav/flac files. ... ```

It will download the weights automatically for you and demix at a higher quality than my site, for two reasons: * Unquantized weights (small impact) * Post-processing step (bigger impact)

I tried to tackle the post-processing step in my C++ code (which would win ~1 dB in quality across all targets) but it's too tricky for now [2]. Maybe some other day.

1: https://github.com/sevagh/free-music-demixer/blob/main/examp...

2: https://github.com/sigsep/open-unmix-pytorch/blob/master/ope...

zzzeek · on July 14, 2023

thanks for the answer and detail! My use case is removing just the drums so that I can make my own drum cover recordings. since all the models are removing all instruments, I have to take the guitar/bass/vocals/other tracks and mix them back together, which produces a very anemic sounding backing track. If I spent the time to train something on just removing drums I suppose it would sound better.

noman-land · on July 13, 2023

What about Melodyne?