Hacker News new | past | comments | ask | show | jobs | submit login

Well tuned zero shot models can use 5 seconds of audio, but the results aren't perfect. You won't capture prosody information, for example.

The human voice isn't as unique as you might think, though. You can encode a lot of information about a voice in about 100Kb.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
