We've been cooking up a new experiment where you can record yourself singing or talking and the app will generate vocals to match your words and timings. It's backed by an end-to-end latent diffusion model that generates audio conditioned on both the style and the lyric timings - and it's quite fast. Your actual voice and melody are not used, just the transcription, and we don't store the recording.
We've found it's a really natural way to control the output you want and dream up a song concept. Curious to hear what you think!
This + the Music ControlNet post from yesterday gives me some hope that audio AI will go the direction of creative tools, rather than dystopian full song generation.