Music ControlNet: Multiple Time-Varying Controls for Music Generation

TaylorAlexander · on Nov 14, 2023

I was thinking recently, now that we have multimodal text and image models, music and sound generation will probably get rolled in to the big foundation models. And then we can look at adding more niche modalities like 3D model generation. As we begin to explore large numbers of modalities we will have highly generalized models.

comex · on Nov 15, 2023

Looking at the "Melody & Rhythm Control" section under "Cherry-picked"... the rhythm control is weird. In many of the examples, the generated music clearly has a different BPM from the reference. But the model seems to still be trying to align the notes in units of time (rather than beats), so the notes get desynced with the beat. The model then tries to cover up the discrepancy and make it sound like syncopation, by emphasizing or de-emphasizing or outright altering notes, but it doesn't work very well.

Maybe conditioning on the BPM would help?

GaggiX · on Nov 14, 2023

The model used here is very small, 41M, I wonder how well it would scale at a bigger size.

SpaceManNabs · on Nov 14, 2023

I understand that this paper is about controls. I wish there was more detail in how it differs to other music generation methods like MusicLM. That seems to be in the MusicGen paper though [5]!

But then I am more curious about how this compares to MusicLM in terms of music generation.

bongwater_OS · on Nov 14, 2023

Love seeing the MIR research from CMU recently. Chris Donahue is the man!!

macawfish · on Nov 15, 2023

Awesome, I've been curious about using controlnet like this. I'm glad someone tried it out.

brrrrrm · on Nov 14, 2023

can I try this out somewhere?