> based on flow matching with Diffusion Transformer Yeah that's not gonna be rea...

modeless · 2024-10-14T22:23:49.000000Z

StyleTTS2 is faster than realtime

moffkalast · 2024-10-15T10:20:31.000000Z

To be clear, what I mean by realtime is full gen under at most 200ms so it can be sent to the sound card and start playing, not generating under the amount of time it would take to play it, which would add that as an unusably long delay in practice.

I suppose it might be possible to do it with streaming very short segments, but I haven't seen any implementation with it that would allow for that, and with diffusion based models it doesn't even work conceptually either.

gunalx · 2024-10-14T22:45:41.000000Z

Bark?