Well tuned zero shot models can use 5 seconds of audio, but the results aren't p...

Well tuned zero shot models can use 5 seconds of audio, but the results aren't perfect. You won't capture prosody information, for example.

The human voice isn't as unique as you might think, though. You can encode a lot of information about a voice in about 100Kb.