As with Stable Diffusion, text prompting will be the least controllable way to get useful output with this model. I can easily imagine midi being used as an input with control net to essentially get a neural synthesizer.
Yes. Since working on my AI melodies project (https://www.melodies.ai/) two years ago, I've been saying that producing a high-quality, finalized song from text won't be feasible or even desirable for a while, and it's better to focus on using AI in various aspects of music making that support the artist's process.
Text will be an important input channel for texture, sound type, voice type and so on. You can't just use input audio, that defeats the point of generating something new. You can't also only use MIDI, it still needs to know what sits behind those notes, what performance, what instrument. So we need multiple channels.
Emad hinted here on HN the last time this was discussed that they were experimenting with exactly that. It will come, by them or by someone else quickly.
Text-prompting is just a very coarse tool to quickly get some base to stand on, ControlNet is where the human creativity again enters.
I think it would be ideal if it could take the audio recording of humming or singing a melody together with a text prompt and spitting out a track that resembles it
It's crazy that nobody cares. It seems to me that ML hype trends focus on denying skills and disproving creativity by denoising randoms into what are indistinguishable from human generation, and to me this whole chain of negatives don't seem to have proven its worth.
LLMs allow people without certain skills to be creative in forms of art that are inaccessible to them.
With Dalee - I can get an image of something I have in my head, without investing into watching hundreds of hours of Bob Ross(which I do anyway)
With audio generators - I can produce music that is in my head, without learning how to play an instrument or paying someone to do it. I have to arrange it correctly, but I can put out a techno track without spending years in learning the intricacies.