Hacker News new | past | comments | ask | show | jobs | submit login

As with Stable Diffusion, text prompting will be the least controllable way to get useful output with this model. I can easily imagine midi being used as an input with control net to essentially get a neural synthesizer.



Yes. Since working on my AI melodies project (https://www.melodies.ai/) two years ago, I've been saying that producing a high-quality, finalized song from text won't be feasible or even desirable for a while, and it's better to focus on using AI in various aspects of music making that support the artist's process.


Text will be an important input channel for texture, sound type, voice type and so on. You can't just use input audio, that defeats the point of generating something new. You can't also only use MIDI, it still needs to know what sits behind those notes, what performance, what instrument. So we need multiple channels.


Emad hinted here on HN the last time this was discussed that they were experimenting with exactly that. It will come, by them or by someone else quickly.

Text-prompting is just a very coarse tool to quickly get some base to stand on, ControlNet is where the human creativity again enters.


Yeah, we build ComfyUI so you can imagine what is coming soon around that.

Need to add more stuff to my Soundcloud https://on.soundcloud.com/XrqNb


For music perhaps. For sound effects I think text prompting is the rather good UI.


Controlnet/img2img style where you can mimic a sound with your mouth and it then makes it realistic could also be usable.


I think it would be ideal if it could take the audio recording of humming or singing a melody together with a text prompt and spitting out a track that resembles it


1. Do your humming and pass it to something like Stable Audio with ControlNet

2. Convert/average the tone for each beat to generate something resembling a music sheet

3. Use vocaloid with LLM generated lyrics based on your prompt (or just put in your lyrics) and pass in the music file

4. Combine the 1-3

Would love to see this


But works great when you don’t need much control, prompt example: “Free-jazz solo by tenor saxophonist, no time signature.”


What other inputs besides text promoting is there for SD? Are you referring to img2img, controlnet, etc?


It's crazy that nobody cares. It seems to me that ML hype trends focus on denying skills and disproving creativity by denoising randoms into what are indistinguishable from human generation, and to me this whole chain of negatives don't seem to have proven its worth.


LLMs allow people without certain skills to be creative in forms of art that are inaccessible to them.

With Dalee - I can get an image of something I have in my head, without investing into watching hundreds of hours of Bob Ross(which I do anyway)

With audio generators - I can produce music that is in my head, without learning how to play an instrument or paying someone to do it. I have to arrange it correctly, but I can put out a techno track without spending years in learning the intricacies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: