Once this technology gets incorporated into DJ mixers / CDJs, this is going to make DJing much more creatively interesting.
Historically, blending between mixed stereo tracks has limited to mixing EQ bands, but now DJs will be able to layer and mix the underlying stems themselves -- like putting the vocal from one track onto an instrumental section on another (even if there were never a capella / instrumental versions released.)
It also opens up a previously unreachable world for amateur remixing in general; for instance, creating surround sound mixes from stereo or even mono recordings for playback in 3D audio environments like Envelop (https://envelop.us) [disclaimer: I am one of the co-founders of Envelop]
Hey, don't knock him. I'd never thought about it before. Being taught or corrected is great as long as people aren't dicks about it. Even then it has some value :)
I shall use this knowledge and endeavour to share it where possible.
Wouldn't it be much more efficient for everyone (and even lucrative for the owners) to also provide the studio stems at a slightly higher/different price?
(not that some of these are not already available when you know where to search, but it's not very... structured)
The "open source" format/practice exists already: just bounce your mix into separate audio files, one for each track or group, into a folder, zip it, ship.
Only, it's not (yet) much embraced on the commercial side. When you pay up to 20 boxes to get a full album, why couldn't you pay, say, 100 to get the same album but with separated tracks for your own use + instructions as who to contact & how for any other kind of uses?
This is a thing specifically in the contemporary Christian music industry, so that churches can pick-and-choose parts from the original song to use as backing tracks for live performance. See e.g. https://www.multitracks.com/songs/Hillsong-Young-And-Free/Th...
For anyone who wants to try Spleeter in a version that "just works" without having to install TensorFlow and mess with offline processing, Spleeter has been been built into a wave editor called Acoustica from Acon Digital. It's been working really well for me, and the whole package is solid competition to editors like iZotope RX:
I've been trying for months to make redistributable Spleeter "binaries" that I can bundle with user-facing applications. Happy to see someone's succeeded where I've failed. Really sad they've chosen not to share their changes :(
I emailed them requesting more info on how their implementation works. I think this might be a violation of the MIT license?
It's something they're working on. You can already change most keyboard shortcuts, but there's a few corner cases that people have been asking for (shortcuts with arrow keys are a problem at the moment). The developers have been extremely responsive to feature requests on the Gearslutz forum though, I've seen some feature requests implemented in just a few days:
I'm going to pretend that we didn't see this (otherwise extremely helpful) link to a major discussion from 6 months ago, so as not to have to mark the current post a dupe.
I often have voice recordings with a lot of background noise (e.g. a public lecture in a room with poor acoustics, recorded from a phone in the audience — there's usually sounds of paper rustling, noises from the street, etc). Is this "source-separation" the sort of thing that could help, or does anyone have other tips? The best thing I have so far is based on this https://wiki.audacityteam.org/wiki/Sanitizing_speech_recordi... —
(1) Open the file in Audacity and switch to Spectrogram view,
(2) set a high-pass filter with ~150 Hz, i.e. filter out frequencies lower than that (which tend to be loud anyway),
(3) don’t remove the higher frequencies (which aren’t loud), because they are what make the consonants understandable (apparently),
(4) look for specific noises, select the rectangle, and use “Spectral Edit Multi Tool”.
But if machine learning can help that would be really interesting! This Spleeter page does mention “active listening, educational purposes, […] transcription” so I'm excited.
I'd generally try iZotope RX for cleaning up audio - Dialogue Isolate is probably the exact feature you would want (and I gather is often used in movies to clean up on location dialogue), but it's only in the most expensive Advanced version:
Cheaper versions of RX still have various noise reduction tools, de-verb for reducing reverb and room echo, and a range of spectral editing tools as well.
You could give a shot to the Nvidia RTX Voice plugin if you have one of the compatible cards. I'm not sure how it deals with low background noises, the youtube reviews mostly tested it with over the top cases like a vacuum cleaner next to the speaker.
https://krisp.ai uses machine learning to remove background noise. I've used them with Zoom calls and it works really well. I think they don't currently have an "upload audio" feature for existing recordings, but it would be awesome if they offered this in the future.
Sorry it's not something you can use now, but I just thought I would mention it! I also did a quick Google search but unfortunately I couldn't find any AI noise removal tools that might solve this problem.
Just tried this and it's really impressive, I'd say it does a nicer job on vocals than Spleeter. Less of the "underwater" effect compared to what I remember of Spleeter.
A local radiostation has a broadcast of four hours. They are required to play an x amount of music tracks by the station (about 6 per hour), but there has been demand to make the broadcast available as podcast without the music.
Could this make it possible to automatically remove the music from the MP3 file they have available? With 6 tracks per hour times 4 hours, manually removing the music is time consuming.
I doubt it, as it seems all vocals are are output to a single file...
Presumably they own the rights on broadcast material, so they'd have to be directly involved in the podcast production. That given, it would probably be more straight forward to take the microphone feeds from their broadcast desk (via "aux-out" perhaps) and record only the spoken output separately.
Sox etc. could be used for silence detection, probably best done in post (scriptable), but could be piped through after experimenting with settings. Otherwise, even old desks can trigger when a mic channel fader is raised, so that too is a possibility for pausing the recording during music.
1. Playback the recording at 4x (1 hour of playback for 4 hours real-time broadcast). Mark the edges where music stops and starts. You have to do that 12 times for 6 songs. Youll have to slow down near the changes in order catch the precise time of an edge. Delete the music between the two edges. Repeat 5 more times.
There may be audacity plugins that do what you want or do something closer to it.
2.use some combination of low pass and high pass filters to remove the music. It's not going to be perfect and you'll still need to edit out the filtered music anyway.
We can deep fake vocals and redraw your photos as if they were painted by van gogh... I'm sure someone has trained something that immortalizes different artists into their AI instrumental avatar.
If not, I'm sure if you ask nicely Amazon will give you a few credits to burn on a pandemic art project.
I couldn't find any examples so was wondering for anyone that's tried this are the results better than using a bandpass filter and an equalizer to isolate frequencies or one of those auto karaoke things?
Because the ability to separate any song into separate tracks would be amazing. The ability to remix any song or just play with any instrument or vocal track would be awesome. But does it have the same poor quality and limitations of most frequency based source separation?
Yeah, the results are a lot better than filtering... deep learning has pushed the state of the art in source separation on quite a lot recently.
It isn’t magical and the results still have artefacts (mostly that kind of slightly underwater sound of a low bitrate MP3, I believe due to the way the audio is reconstructed from FFTs), and some songs trip it up entirely, but it’s definitely worth playing around with and I think it could potentially have applications for DJ/remix use if you added enough effects etc.
It’s fairly easy to install and runs quickly without GPU, or you can try their Collab notebook, or seems someone has hosted a version at https://ezstems.com/
Had a play with the Colab and it's quite good indeed. The authors claim "100x real time speed", which is mighty impressive, but I'd be more interested in seeing a "Try Really Hard" mode, trading off quality and speed. Is that a thing that can be done in the current code, I wonder?
If you're trying to run it on Windows with Python 3.8, add numpy and cython to the dependencies, and change Tensorflow's requirement to be >= rather than ==.
Though then you'll run into compatibility errors like "No module named 'tensorflow.contrib'" which you'll have to fix.
While this is awesome, it's trained on MUSDB18-HQ which as far as I can tell is proprietary. zenodo.org claims it is available, however I have filled out their "request access" page a half-dozen times. Does anyone know of a training data-set that's possible to obtain?
Here is the zenodo response:
Your access request has been rejected by the record owner.
The decision to reject the request is solely under the responsibility of the record owner. Hence, please note that Zenodo staff are not involved in this decision.
Out of interest, and to put this in context - your brain can only do this for conversation, not music.
You routinely suppress background noise and room acoustics when listening to someone speaking. But you don't do the same thing when listening to music. At best you can focus on individual elements in a track, and you can parse them musically (and maybe lyrically).
But you don't suppress the rest to the point where you don't hear it.
To somebody with APD, it sounds like science fiction, although it does require more suspension of disbelief than faster than light travel or teleportation.
Once you have obtained just the Guitar from a track, are there any tools out there which can work out the Tablature (eg. https://www.ultimate-guitar.com//top/tabs) so you can play along?
Well, it seems neural networks started to appear for vocal and instrumental track isolation^^ recently I've discovered https://www.lalal.ai and it works quite well
I tried using the 2 stem model to remove the music from an audio recording of two people talking. It kept sucking in some of the music whenever someone started talking, however. Is there a better model to use for that?
Sometimes the distinction is made between "real-time" and "online" processing.
The first one refers to the speed of the processing in relation to the length of the recording - so, say, you can process a 10 minute recording in 1 minute then you're 10x real-time. However, your analysis might require the full track to be available for best outcomes, and so you cannot really start with the processing until the full source is available.
The latter is what "online" processing refers-to, the ability to process on-the-fly in parallel to the recording. Obviously, this cannot be faster than real-time ;-) but hopefully it is not slower, either. Often times, though, you get a (somewhat constant and) hopefully slow offset, i.e., you can process a 10 minute recording online in the same time but you need another 10 seconds on top of that.
This is, by the way, not restricted to source separation, it applies to other disciplines as well, say, automatic speech recognition.
I experimented with the spleeter architecture quite a bit and I would say this is not suitable for real time audio processing. The reason is that the model needs at least 512 frames of audio samples to produce an output usable for source separation. This adds a ton of latency. I tried with smaller windows but the results are very bad.
It's way faster than real time, im not sure why slowing it down would be an advantage. You still need to take the resultant data and do things with them, as a dj, and faster is better.
This is ultra-cool .. I have a few terabytes of jam-session recordings that I'm going to throw at this. If it ends up being usable to the point that I can re-do vocals over some of the greatest moments in the archive, I'll be praising whatever Spleeter deity makes itself visible to me at the time, most highly ..
Historically, blending between mixed stereo tracks has limited to mixing EQ bands, but now DJs will be able to layer and mix the underlying stems themselves -- like putting the vocal from one track onto an instrumental section on another (even if there were never a capella / instrumental versions released.)
It also opens up a previously unreachable world for amateur remixing in general; for instance, creating surround sound mixes from stereo or even mono recordings for playback in 3D audio environments like Envelop (https://envelop.us) [disclaimer: I am one of the co-founders of Envelop]